Getting Started with R

Ever wanted to try R, but wasn't sure where to start? Or perhaps you are unsure of what R is and what it can do? This guide is designed to get you started with R and teach you the basics - showing you where to obtain R and introducing you to the R environment. It gives a quick overview of data structures and basic commands, and shows you where to get more help to develop your R skills.

The goal is to introduce you to the R environment, so you can become familiar with R, and can then start to experiment and perform your own data analysis with confidence.



Getting Started with R

! This guide was written using R version 3.4.2 on Windows 10.

So, what is R and why use it? The official R website defines R as a language and environment for statistical computing and graphics - it can be used to perform a wide range of statistical tests and almost any kind of data analysis, produce high quality graphics and figures, and can also be used as a geographic information system (GIS) to produce maps and carry out spatial analysis. It can read just about any kind of data format, and most graphics formats, so you can easily import and start working with your existing data.

It is similar to other statistical packages such as MATLAB, SPSS, Excel and Minitab, yet it is completely free and open-source software, so it won’t cost you a penny!

R is cross-platform, so whatever system you are running, you can download and use R – even on a Raspberry Pi

R is highly extensible and there are thousands of "packages" available which can extend the core R functionality to perform just about anything you can imagine. Or, you can write your own functions.

R is simple and easy to use (really!), and if you get stuck there is a large community out there to help out.

© 2018 Benjamin Bell. All Rights Reserved. http://www.benjaminbell.co.uk

Downloading and installing R

R can be downloaded for free from the official website: the Comprehensive R Archive Network (CRAN) https://www.r-project.org/ (available for Windows, Mac OS X, Linux and others). The default R environment allows you to run your R code and produce high quality figures with ease. It does not include a source code editor, so you would need to use a separate program to write and edit your code - and then run it through the R console.

I write all of my R code using Notepad++, which has code syntax highlighting and other useful features (such as the ability to edit multiple lines of code at once). Notepad++ is only available for Windows systems.

If you use another operating system, similar editors are available, some of which are part of the OS. I also recommend Visual Studio Code, which can be installed on Windows, Mac OS X and Linux. And it's free!

You could also choose to use an Integrated Development Environment (IDE) instead, such as RStudio. This includes many additional features, and includes a source code editor. It doesn't matter which editor or program you decide to use, as the code is run through the R console, and the results are the same.

Once you have downloaded and installed R, you will be presented with the R console where you can input and execute your R code, and will look something like this (if using Windows):

The R console can be customised to change fonts and colours to what suits you, for example, you may prefer to use a darker background and lighter text.


The format of these guides

These guides are designed to take you through working examples step by step. As R is frequently updated, and data sources can also be updated regularly, each guide states which version of R was used when making the guide. It will also state the version of source data used if applicable. Newer versions of R, or packages should maintain compatibility with past versions, but its possible things can change. e.g.

! This guide was written using R version 3.4.2 on Windows 10.

If the guide is updated (e.g. to correct any errors), this will also be indicated at the start of the guide, with a changelog at the bottom of the guide. e.g.

! This guide was updated on 06/01/2018.

R code examples will appear in a box as below, with syntax highlighting and numbered lines to make it easy to read, while inline code examples e.g. plot(x, y) will appear within the main text.

# Example code, as it will appear in these guides
x <- c(5, 6, 4, 2, 3, 1)
y <- c(7, 1, 3, 1, 6, 4)

plot(x, y)

R console examples will appear in a box as below, with commands indicated by the > indicator at the start of the line. All R commands are written on one line – if the command is not complete, the next line will start with a + symbol to indicate a continuation of the previous command.

> x <- 5
> y <- 5
> x + y
 [1] 10

Outputs from these commands are printed on a new line which does not start with a > indicator.

© 2018 Benjamin Bell. All Rights Reserved. http://www.benjaminbell.co.uk

Using R

R is an interpreted language, and the R console presents you with a command prompt (>) where you can enter commands one at a time. R is case sensitive, so R and r are different. R commands generally consist of objects and functions – an object can be anything you create, and can later manipulate with different functions. There are various different data types in R (discussed later in this guide) which can be created and assigned to objects. The default assignment operator in R is <-

Confused? These examples should clear things up.

Firstly, in the R console you are going to create a new numeric vector object named x and assign it a value of 5.

> x <- 5

The above example shows the R console, with the command prompt (>) followed by the object (x), the assignment indicator (<-) and the value (5) assigned to the object.

Next, if you had two vector objects and wanted to know their combined value, you could run the following commands:

> x <- 5
> y <- 5
> x + y

Here, you have created two objects x and y, and then issued a command to add the value of these ojects together x + y. R then prints the results in the R console.

> x <- 5
> y <- 5
> x + y
 [1] 10

The next example demonstrates some of the basic functionality which R can be used for.

Lets say you have two sets of data and want to perform some simple analysis and plot the results. Firstly, you would input your data into R. In this case, create two vector objects with a series of values.

> x <- c(6, 4, 9, 10, 3, 2, 2, 5, 6, 6)
> y <- c(3, 3, 7, 6, 1, 1, 2, 3, 4, 3)

Here, you have created two numeric vector objects, each of which have multiple values, created using the combine function c(). You might then want to perform some simple descriptive statistics, such as finding out the minimum min(), maximum max() and mean values mean(), or the standard deviation sd() for your data.

> x <- c(6, 4, 9, 10, 3, 2, 2, 5, 6, 6)
> y <- c(3, 3, 7, 6, 1, 1, 2, 3, 4, 3)
> min(x)
 [1] 2
> max(x)
 [1] 10
> mean(x)
 [1] 5.3
> sd(x)
 [1] 2.710064

Now, you might want to run linear regression on your data. The linear regression command in R is lm(), but you need to assign the linear regression command to an object to easily see the output of the model.

> xy.reg <- lm(x ~y)

Here, you have created a new object (xy.reg) which contains the linear regression model for your data (x versus y). But, you want to know the results of the model - so, in the R console you would use the summary() command, which will print out the results:

> xy.reg <- lm(x ~ y)
> summary(xy.reg)

Call:
lm(formula = x ~ y)
        
Residuals:
Min       1Q   Median       3Q      Max 
-1.61877 -0.76540 -0.05865  0.98460  1.20821 
        
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.0323     0.6894   1.497 0.172697    
y             1.2933     0.1823   7.094 0.000103 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
        
Residual standard error: 1.065 on 8 degrees of freedom
Multiple R-squared:  0.8628,    Adjusted R-squared:  0.8457 
F-statistic: 50.32 on 1 and 8 DF,  p-value: 0.0001026      

summary() gives you all the information relating to your model, including the r2 value, the adjusted r2 value (useful for multiple regression), the F statistic, and the p-value to indicate whether your model is significant or not.

Now that you have your results, you may want to visualise the data. R has some really powerful features for visualising data, but for now we will stick with the basics.

To plot your results, use the plot() command, which will give you a nice figure showing your data:

> plot(x, y)

Results in a simple figure:

You could then add the linear regression trendline to your plot using the abline() command in the R console:

> abline(lm(y ~ x))

You have created a simple plot, using the default options in R. You might want to change some of the colours, or add a title to your figure. Plots are totally customisable in R, with the ability to change just about everything. Commands entered after plot() will add to the plot currently displayed. You can start a new plot by closing the displayed plot figure, or by using the plot.new() command in the R console.

The following code will change the colours of the plot and trendline, and add a title to your figure.

plot(x, y, col="blue")
abline(lm(y ~ x), col="red")
title(main="My First Plot")

To save a plot, select the plot window in the R environment, then use the menu: File > Save as > ...

Alternatively, you can write the plot directly to a file in the R console by preceeding plot() with a command to tell R to create an image file, for example a png png(), then using a final command dev.off() to tell R that you have finished creating the image. The resulting plot will not display in the R environment and may look slightly different to onscreen.

png("your_filename_here.png")
plot(x, y)
dev.off()

Future guides will explore plotting and the creation of figures in more detail.


© 2018 Benjamin Bell. All Rights Reserved. http://www.benjaminbell.co.uk

Quick overview of R data structures

R has several ways in which it can hold data, which includes (but not limited to): vectors, a matrix, data frames, arrays and lists. A comprehensive guide to all the data structures in R and how they can be used can be found in the R documentation. This guide will give a quick overview of some of the common data structures you are likely to use often. There are many options for manipulating data in these data structures which are not covered by this guide, refer to the R documentation and tutorials linked later in the guide for details.

Vector

Vectors are the simplest data structures in R. They are one-dimensional, and can hold either numeric, character or logical data. You cannot mix different types of data in the same vector object. You can add multiple data values using the combine c() function.

Numeric data consists of numbers separated by a comma. Numeric data does not need to be enclosed in quotation marks.

site <- c(1, 2, 3, 4, 5, 6)

Character data consist of characters, or strings of text, for example, they could be a series of sample site names. Character data needs to be enclosed in quotation marks.

 site.name <- c("Site A", "Site B", "Site C", "Site D", "Site E", "Site F")

Logical vectors contain logical constants, e.g. TRUE or FALSE.

site.visited <- c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)

You can check whether an object is a vector using the is.vector() command. You could also check whether the vector object is numeric is.numeric() or character is.character() or logical is.logical() in the R console. For example:

> is.vector(site)
 [1] TRUE   
> is.numeric(site)
 [1] TRUE   
> is.character(site)
 [1] FALSE   
> is.logical(site)
 [1] FALSE

You can reference data in a vector by using its numeric position within the vector object. For example, to find the fifth value of the vector site.name, input the following command into the R console:

> site.name[5]
 [1] "Site E"

Matrix

A matrix is a two-dimensional array which can contain only one type of data (numeric, character or logical). They can be created using the matrix() function, which you assign to an object.

For example, let’s create a matrix containing the site names from the vector object site.name

# Create vector of site names
site.name <- c("Site A", "Site B", "Site C", "Site D", "Site E", "Site F")
# Create matrix from the vector
mat <- matrix(site.name, ncol=3)

This creates a matrix using the data contained with the vector with 3 columns. By default, R fills in data by column.

> mat
     [,1]     [,2]     [,3]    
[1,] "Site A" "Site C" "Site E"
[2,] "Site B" "Site D" "Site F"

To fill in the matrix by row, add the option byrow=TRUE to the matrix command:

mat <- matrix(site.name, ncol=3, byrow=TRUE)

Which will result in the following matrix being outputted:

> mat
      [,1]     [,2]     [,3]    
[1,] "Site A" "Site B" "Site C"
[2,] "Site D" "Site E" "Site F"

Data Frame

A data frame is similar to a matrix, but it can contain a mix of data types for different columns. A data frame is similar to a spreadsheet you might use in Excel or other software.

One way to create a data frame data.frame() is from vector objects. For example, lets create a new data frame object from the three vectors used in the previous vectors example.

df <- data.frame(site, site.name, site.visited)

Which would result in the following data frame object:

> df
  site site.name site.visited
1    1    Site A         TRUE
2    2    Site B         TRUE
3    3    Site C         TRUE
4    4    Site D        FALSE
5    5    Site E        FALSE
6    6    Site F        FALSE

If you are using the default R environment, you can edit the data within your data frame without having to recreate it, by using the fix() command, which will result in the data editor window appearing. In here you can select any cell to correct any errors or change the data.

> fix(df)

If working with large datasets it can often be easier to edit and manage your data using spreadsheet software - and then import this dataset into R. This is an extensive topic, beyond the scope of this introduction. Full details about importing data can be found in the R documentation


A quick note on #comments

When writing lots of code (for any programming/scripting language) it is always a good idea to document what you are doing. Ideally, your code should be clear - but when you start to write lots of code, and start having lots of different functions, it can become confusing. Perhaps you might revisit your code at a later date - will you know what your code does without comments?.

Commenting on your code is always a good idea to explain what the command or functions are doing and/or how they are doing it. For your own sanity I advise to write lots of comments throughout your R code to keep it clear, so that you and others can follow it easily.

A great blog post on the good, the bad, and the ugly when it comes to adding comments to your code can be found here

© 2018 Benjamin Bell. All Rights Reserved. http://www.benjaminbell.co.uk

Where to get more help

Hopefully this guide will have given you a good introduction to the basics of R and made you familiar with the R environment. If you want to explore R further, there are plenty of resources available to help you out.

If like me, you like to have a good book to refer to, then i highly recommend R In Action by Robert I. Kabacoff. This book will tell you everything you need to know about using R, starting with an introduction to the basics, moving on to advance techniques. It is clearly written and easy to follow, and serves as a useful reference book.

If you enjoyed this guide and wanted to purchase this book, please consider using the affiliate link below.

There are loads of free guides online which give an expanded introduction to R, and the best place to start is the official R introduction.

If you are ever stuck on how to do something, Stack Overflow is a great resource for solving coding problems. It can almost be guaranteed that your question already has a solution on Stack Overflow.

You can also get plenty of help within R itself. Do you want to know what a particular command does, or what options it has? Refer to the commands documentation using ? followed by the command e.g. ?plot will load the help page for the plot() command, and explain how to use it, along with all the options.

Thanks for reading this introduction to R! Please leave any comments below.


Ad

© 2018 Benjamin Bell. All Rights Reserved. https://www.benjaminbell.co.uk

Further reading

A quick guide to pch symbols - A quick guide to the different pch symbols which are available in R, and how to use them. [R Graphics]

A quick guide to line types (lty) - A quick guide to the different line types available in R, and how to use them. [R Graphics]

Extracting CRU climate data - A 4 part guide series which shows you how to download climate data and analyse it with R.

Pollen diagrams using rioja - Part 1 of a 3 part guide series where I show you how to plot pollen diagrams using rioja.

Principal components analysis (PCA) in R - A guide showing you how to perform PCA in R, and how to create great looking biplots.


No comments:

Post a Comment