Benjamin Bell: Blog: Getting Started with R: Introduction to R

Getting Started with R: Introduction to R

by Ben on Saturday, January 06, 2018

Ever wanted to try R, but wasn't sure where to start? Or perhaps you are unsure of what R is and what it can do? This guide is designed to get you started with R and teach you the basics - showing you where to obtain R and introducing you to the R environment. It gives a quick overview of data structures and basic commands, and shows you where to get more help to develop your R skills.

The goal is to introduce you to the R environment, so you can become familiar with R, and can then start to experiment and perform your own data analysis with confidence.

NEW updated guide for 2021!

Guide Information

Title	Getting started with R: Introduction to R
Author	Benjamin Bell
Published	January 06, 2018
Last updated	August 24, 2021
R version	3.4.2
Packages	base
Navigation	What is R? Downloading and installing R Using R R Working directory Objects Functions and arguments Operators Comments Data structures Vector Matrix Data Frame List Selecting data Subsetting data Where to get more help and the next steps Further reading

This guide has been updated:

This guide has been rewritten and expanded into a 3 part guide to give a better overview of R covering all the basics to give an overview of the language, its features, and how to get started.

This is a 3 part guide:

Part 1: Introduction to R	An introduction to R - what is R, where to get R, how to get started using R, a look at objects and functions, an overview of data structures, and getting started with manipulating data.
Part 2: Importing data into R	This guide shows you how to import your existing data into R from .csv or Excel files.
Part 3: An overview of graphics in R	This guide gives you an overview of some of the graphical capabilities of base R to produce high quality plots.

What is R?

R is a programming language for statistical computing and producing graphics. It can be used to perform a wide range of statistical tests (e.g. linear and non-linear modelling, classical statistical tests, time-series analysis, classification and more), data analysis and data science, produce high quality graphics and figures, and R can also be used as a geographic information system (GIS) to produce maps and carry out spatial analysis.

R can read just about any kind of data format and graphics formats, so you can easily import and start working with your existing data.

R is an interpreted language. R code and commands are read by an "interpreter" and translated into machine code on the fly, rather than being compiled.

R comprises a "base" installation which contains default functions for performing analysis. R is highly extensible, and the base installation can be extended through "packages" which offer new functionality. Many of these packages are themselves written in R, but can also be written in other languages (e.g. C, C++). As well as extending functionality through packages, users can also write their own functions within R.

R is free and open source software and cross-platform. R is based on the S programming language. For more information about R, you can visit the official website.

Downloading and installing R

R is available for many systems including Windows, MacOS and Linux. You can download R through "CRAN", the Comprehensive R Archive Network. Select a mirror closest to your location, and then click on the "Download R for ..." link depending on which OS you are running. You should then click "base" if you are installing R for the first time.

For your convenience, you can use the following links to download the latest version of R: Windows, MacOS, and Linux.

Once downloaded, run the installation file and follow the on-screen instructions, and R should be ready to use.

The default R environment consists of a terminal to run R code (known as the R console). This differs depending on the operating system you use. On Windows, it will look something like this:

You enter your R code into the console, hit enter, and it is executed. R code and commands are executed one line at a time.

The default R environment does not include a source code editor, and since writing R code directly into the R console is not an ideal way to work, you may wish to use use an Integrated Development Environment (IDE) instead, such as RStudio, which you need to download separately.

There are also many other source code editors and IDEs which you can use to write R code, for example: Visual Studio Code which is available on all platforms.

Using R

R is an interpreted language, and the R console presents you with a command prompt (>) where you can enter commands one at a time. R is also case sensitive, so R and r are different.

R commands can consist of objects, functions, arguments and operators, but just before delving into that, it is a good idea to set up your R environment, or your working directory before starting any R project.

R Working directory

You should now have R installed and are using either the default R environment with a text/code editor, or you are using an IDE like RStudio.

It is recommended to keep your R projects in separate folders or directories. You should create the folder (in the usual way for your OS/filesystem) and then set the working directory to that folder in R.

For example, if your working folder was: /Users/ben/R/project1 you would use the following code within R to change to this folder:

setwd("/Users/ben/R/project1")

For Windows systems, you must use a forward slash (/) when typing file paths in the R console, instead of the usual back slash (\).

If you want to find out what working directory you are currently in, type getwd() into the R console.

To save your R session on Windows systems, select File and then Save Workspace on the menu bar, which will save your session as a .RData file.

MacOS users can select Workspace and then Save Workspace File... on the menu bar.

Alternatively, you can use the following code in the R console to save the session (Linux users must save their R sessions in this way):

save.image("filename.RData")

This will save the current R session to the filename specified to your current working directory. For further options on saving data, type ?save into the R console.

When you quit R, you will also be prompted to save the current R session.

To load a previously saved R session, the easiest way is to click (or double click) on the saved file, which should open up R (this will depend on your operating system). Or, Windows users can select File and then Load Workspace on the menu bar within R. And MacOS users can select Workspace and then Load Workspace File... on the menu bar.

Alternatively, use the following code in the R console to open a previously saved R session (Linux users should load an R session in this way):

# First set the working directory to where you saved your R session
setwd("/Users/ben/R/project1")
# Load the R session
load("filename.RData")

When you re-open your saved R session, all the objects previously created will be loaded. You can use ls() to see a list of all the objects in the session.

By saving your R code separately (using your code editor), you will always be able to recreate your R session and data if you did not save it within R.

Objects

Everything in R is an object. You can create objects by assigning values, data and functions to them. There are various different data types in R (discussed later in this guide) which can be created and assigned to objects. You assign values to objects using the assignment operator <-.

Here's a simple example:

> my_object <- 5

The above example shows the R console, with the command prompt (>), where we have created an object (my_object) using the assignment operator (<-), and given it a value (5). The created object is a numeric vector. Vectors are the simplist data structures in R.

Functions and arguments

Functions are objects that perform a task. Functions may contain arguments, which you can use to tell R how to perform the task, changing the way it works - think of them as options. R contains many built-in functions, or you can write your own functions.

Here's some example code:

# Create some data
x <- c(2, 4, 6, 8)

In the above code, we have created a numeric vector (x) by using a built-in function c() which combines different elements to form a vector.

If we wanted to know the mean of x, we could use the mean() function. We could just type this into the R console without assigning it to an object to get the result:

> mean(x)
[1] 5

Alternatively, we could assign the function as an object:

# Create some data
x <- c(2, 4, 6, 8)
# Mean of x
m <- mean(x)

To get the result now, we simply input "m" into the R console:

> m
[1] 5

Lets consider another function to create a sequence of numbers. To create the sequence using the seq() function, we will have to assign values to the function arguments:

# Create a sequence of numbers
my_sequence <- seq(from = 0, to = 50, by = 10)

If you typed "my_sequence" into the R console, you would get the following result:

> my_sequence
[1]  0 10 20 30 40 50

The values that you assign to function arguments can also be objects, for example:

# Objects
start <- 0
end <- 50
inc <- 10
# Create a sequence of numbers
my_sequence <- seq(from = start, to = end, by = inc)

Typing my_sequence into the R console would give the same result as before. However, one of the benefits of using objects for function arguments, is that you can change the values of the object, without having to rewrite the function code.

If you ever get stuck with what something does, it is easy to call up the help pages using the "?" command. For example, to call up the help page for the mean() function, type the following into the R console:

> ?mean

Operators

Operators can perform tasks in R, such as arithmetic, compare values (relational operator), or carry out Boolean operations (logic). For example, you might use an arithmetic operator to calculate values:

> 8 * 10
[1] 80

Or, you might want to use a relational operator to compare values:

> 8 * 10 == 80
[1] TRUE

You can see a list of all the operators in R's help pages:

# Arithmetic operators
?Arithmetic
# Relational operators
?Comparison
# Logical operators
?base::Logic

Comments

In the above code examples, you notice many lines start with a "#" symbol. This tells R to ignore everything on that line (i.e. not to execute it as code) as it is a comment.

When writing lots of code (for any programming/scripting language) it is always a good idea to document what you are doing. Ideally, your code should be clear - but when you start to write lots of code, and start having lots of different functions, it can become confusing. Perhaps you might revisit your code at a later date - will you know what your code does without comments?.

Commenting on your code is always a good idea to explain what the command or functions are doing and/or how they are doing it. For your own sanity I advise to write lots of comments throughout your R code to keep it clear, so that you and others can follow it easily.

A great blog post on the good, the bad, and the ugly when it comes to adding comments to your code can be found here

Data structures

R has several data structure types for storing and manipulating data, which includes: vectors, matrices, data frames, arrays and lists. This section provides an introduction to the most common data structures you are likely to use. For a full and comprehensive overview of the data structures in R, take a look at the official documentation.

Vector

Vectors are the simplest data structure in R. They are one-dimensional and hold either numeric, character or logical data. You cannot mix different data types in the same vector object. To create a vector containing multiple values, you use the combine function c().

# Numeric vector
num <- 10
# Character vector
char <- "a"
# Logical vector
log <- TRUE

# Vectors with multiple values
# Numeric
nums <- c(10, 20, 30, 40, 50, 60)
# Character
chars <- c("a", "b", "c", "d", "e", "f")
# Logicial
logs <- c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)

For character data, you must enclose the values within quotation marks. This is not necessary for numeric or logical data.

Matrix

A matrix is a two-dimensional array, which like vectors can only contain one type of data. You can create a matrix matrix() by specifying the data within the function, or specifying a vector or vectors. Consider the following examples:

# Create a simple matrix with 3 columns
mat1 <- matrix(c("a", "b", "c", "d", "e", "f"), ncol=3)
# Create a matrix from a vector with 3 columns
mat2 <- matrix(chars, ncol=3)
# Create a matrix from 2 vectors with 2 columns
mat3 <- matrix(c(chars, chars), ncol=2)

In the first example, the matrix was created by specifying the data to be includes within the matrix() function, while the second example told R to get the data from the chars vector. The third example tells R to get the data from two vectors (in this case they are both the same). You must specify either the ncol or nrow argument to create a matrix (you can also specify both).

The resulting matrix would look like this (for the first and second example):

    [,1] [,2] [,3]
[1,] "a"  "c"  "e" 
[2,] "b"  "d"  "f"

And for the third example:

    [,1] [,2]
[1,] "a"  "a" 
[2,] "b"  "b" 
[3,] "c"  "c" 
[4,] "d"  "d" 
[5,] "e"  "e" 
[6,] "f"  "f"

You'll notice that matrices are created in column order. To change this, add the argument byrow=TRUE to the code. For example:

# Create a matrix from a vector with 3 columns, with data entered by row
mat4 <- matrix(chars, ncol=3, byrow=TRUE)

The resulting matrix would now look like this:

    [,1] [,2] [,3]
[1,] "a"  "b"  "c" 
[2,] "d"  "e"  "f"

Data Frame

A data frame is similar to a matrix, but it can contain a mix of data types. It is like a spreadsheet you might use in Microsoft Excel or Google Sheets.

You can create a data frame data.frame() in a similar fashion to creating a matrix. Consider the following examples:

# Create a data frame from vectors
df1 <- data.frame(nums, chars, logs) # Example 1
# Create a data frame by specifying data
df2 <- data.frame(col1=c(1,2,3,4), col2=c("a", "b", "c", "d"), col3="Name") # Example 2

The first data frame would look like this:

  nums chars  logs
1   10     a  TRUE
2   20     b FALSE
3   30     c  TRUE
4   40     d FALSE
5   50     e  TRUE
6   60     f FALSE

And the second data frame would look like this:

  col1 col2 col3
1    1    a Name
2    2    b Name
3    3    c Name
4    4    d Name

The data frame takes the column names either from the name of the vector (first example), or the specified name (second example). You'll notice in the second example, data was specified for each row for columns 1 and 2, but only one value was specified for column 3, so this value is repeated to fill all the rows.

You can also specify row names by adding the row.names argument to the function, which can either refer to a vector of character strings, or you can type them after the argument. For example:

# Add row names
df3 <- data.frame(col1=c("a", "b"), col2=c(1, 2), row.names=c("row1", "row2"))
# Add row names using a vector of character strings
mynames <- c("row1", "row2")
df4 <- data.frame(col1=c("a", "b"), col2=c(1, 2), row.names=mynames)

Both examples would result in the same data frame:

     col1 col2
row1    a    1
row2    b    2

For matrices and data frames it is generally good practice to organise them with columns as the variables and rows as the observations. Many functions will make this assumption when handling data from matrices and data frames.

List

A list is a collection of objects and each list item can be any type of object. For example, you could have a list of several vectors, or you could have a list of vectors and data frames, or any other combination you can think of. Lists become extremely useful for storing and manipulating data, especially useful when used with loops and functions.

You can create a simple list using the following code:

# Create a list of vectors
mylist1 <- list(nums, chars, logs)

Which would look like this in the R console:

> mylist1
[[1]]
[1] 10 20 30 40 50 60

[[2]]
[1] "a" "b" "c" "d" "e" "f"

[[3]]
[1]  TRUE FALSE  TRUE FALSE  TRUE FALSE

You can also specify names for each list item (either on creation, or after):

# Create a list of vectors with names
mylist1 <- list(Numbers=nums, Characters=chars, Logical=logs)

# or Add names to an existing list
names(mylist1) <- c("Numbers", "Characters", "Logical")

The list would now look like this in the R console:

> mylist1
$Numbers
[1] 10 20 30 40 50 60

$Characters
[1] "a" "b" "c" "d" "e" "f"

$Logical
[1]  TRUE FALSE  TRUE FALSE  TRUE FALSE

Selecting data

R is really powerful when it comes to selecting data within a data structure.

Data stored within a vector (or any other data structure) can be accessed by reference to its index (position) within that vector. Each piece of data within the vector can be considered an "element". For the nums vector (created earlier), the values 10, 20, 30, 40, 50, 60 would correspond to an index value of 1, 2, 3, 4, 5 and 6 respectively.

You can access specific data within a vector based on the index using square brackets [ ]. Consider the following code, based on the example vectors created earlier:

# Select the first value in the nums vector 
nums[1] 
# Select the second value
nums[2]
# Select a range of values
nums[1:3]
# Select multiple values
nums[c(1, 3, 5)]

In these examples, you select the data from the nums vector by using enclosing the index value within the square brackets. You can select single values, a range of values or multiple values.

You could use this to perform analysis or calculations on specific data, for example:

>nums[1] * nums[5]
[1] 500

You can select data from a matrix in much the same way as you can from a vector, but now the index would refer to the index value of the cell. Consider the following examples:

# Create a matrix
mat1 <- matrix(1:6, ncol=3)
# Create a matrix, but order by rows
mat2 <- matrix(1:6, ncol=3, byrow=TRUE)

The first matrix will look like this in the R console: Column and row numbers are shown in square brackets, while the values within the matrix refer to the index of the cell.

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

The second matrix will look like this in the R console: Notice the values of the cells have been re-ordered.

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

You can also select the data from the entire column or row. For example, to select the first row from the matrix we created earlier, put a comma after the row number e.g. mat1[1,], or second row mat1[2,] and so on. Columns would be selected by putting a comma before the column number: mat1[,1] or mat1[,2].

This will also work for a data frame. But, to select a specific cell from a data frame, you need to specify both the column or row, and the cell position. To select the first cell from the first column from the data frame df1 that we created earlier, you would use df1[,1][1], or to select the fifth cell from the first column, use df1[,1][5]. To select the third cell from the first row use df1[1,][3], or the third cell from the fourth row use df1[,4][3].

To select data from a list, you also need to specify the list item number using double square brackets [[ ]].

To select a specific element from an object within a list item, you must specify both the list index, and the element index value.

# Select the first list
mylist1[[1]]
# Select the first element from the first list
mylist1[[1]][1]

The results would show as follows in the R console:

> mylist1[[1]]
[1] 10 20 30 40 50 60
> mylist1[[1]][1]
[1] 10

If you had stored a matrix in a list, you could also select specific rows or columns, or cells

# Create a list of matrices
mylist2 <- list(mat1, mat2)

# Select the first row from the first list item
mylist2[[1]][1,]
# Select the third cell from the first list item
mylist2[[1]][3]
# Select the third value of the first row from the first list item
mylist2[[1]][1,][3]

You'll get the following results in the R console:

> mylist2[[1]]
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
> mylist2[[1]][1,]
[1] 1 3 5
> mylist2[[1]][3]
[1] 3
> mylist2[[1]][1,][3]
[1] 5

Notice that when selecting the third cell from the matrix, R returned the value "3", while you may have been expecting "5". This is because the index values are created in column order.

Subsetting data

Subsetting data works in a similar way to selecting data by using square brackets. You can subset data within an existing data structure, or create a new object using the subsetted data from another object. Consider the following examples using the "chars" vector created earlier:

# Create a new vector containing the first 3 elements
new1 <- chars[1:3]
# Create a new vector, by removing the first 2 elements
new2 <- chars[-c(1,2)]
# Alter an existing vector
chars <- chars[c(1,3,5)]

This would produce the following results in the R console:

> chars
[1] "a" "b" "c" "d" "e"
> new1
[1] "a" "b" "c"
> new2
[1] "c" "d" "e" "f"
> chars
[1] "a" "c" "e"

New vectors were created either by telling R which elements to include, or telling R which elements to remove by using the minus symbol.

Subsetting data becomes really powerful by using conditional arguments to select (and subset) data that matches defined conditions. You can do this using the subset() function, for example (using the "nums" vector from earlier):

# Subset data with a value greater than 20
subset(nums, nums > 20)
# Subset data with a value equal to 20
subset(nums, nums == 20)
# Subset data with a value greater than or equal to 20
subset(nums, nums >= 20)

# Create a new vector with the subsetted data
newdata <- subset(nums, nums >= 20)

Which would produce the following results:

> subset(nums, nums > 20)
[1] 30 40 50 60
> subset(nums, nums == 20)
[1] 20
> subset(nums, nums >= 20)
[1] 20 30 40 50 60
> newdata
[1] 20 30 40 50 60

In this example, a new vector object newdata is created which contains only the data matching the arguments of the subset() function. The arguments specify the object to be subsetted (in this case, the vector nums), and how to subset the object (in this case, by values which are equal to or greater than 20 within the nums vector). Lets break down the code to make this easy to understand:

# New vector object to be created
newdata <- subset(nums, nums >=20)
# Function to subset the data
newdata <- subset(nums, nums >= 20)
# Arguments to use
newdata <- subset(nums, nums >= 20)
# Object to be subsetted
newdata <- subset(nums, nums >= 20)
# Using data within this object
newdata <- subset(nums, nums >= 20)
# Logical operator to be used to subset the data (greater than or equal to)
newdata <- subset(nums, nums >= 20)
# By this value
newdata <- subset(nums, nums >= 20)

Subsetting is really powerful when used with a matrix or data frame. Consider the following examples, using the data frame from earlier (df1):

> df1
  nums chars  logs
1   10     a  TRUE
2   20     b FALSE
3   30     c  TRUE
4   40     d FALSE
5   50     e  TRUE
6   60     f FALSE

Lets say you wanted to select rows where logs = TRUE, or rows where nums was equal to or greater than 40. You could use the following code:

# Subset data from data frame which matches conditions
# Example 1
newdf1 <- subset(df1, logs == TRUE) 
# Example 2
newdf2 <- subset(df1, nums >= 40)

In the first example, a new data frame is created newdf1, which contains subsetted data from df1. The data it contains matches the condition set in the subset() function - that the new data frame should contain data where the logs column is equal to TRUE.

> newdf1
  nums chars logs
1   10     a TRUE
3   30     c TRUE
5   50     e TRUE

In the second example, a new data frame is created newdf2, which also contains subsetted data from df1 that matches the condition set within the subset() function. This time, the condition is that the new data frame should contain data where the nums column is greater than or equal to 40.

> newdf2
  nums chars  logs
4   40     d FALSE
5   50     e  TRUE
6   60     f FALSE

You could also combine both conditions into a single function:

newdf3 <- subset(df1, nums >= 40 & logs == TRUE)

Now, R will subset the data matching the first condition, then it will subset that data by the second condition. This would result in the following new data frame:

> newdf3
  nums chars logs
5   50     e TRUE

Where to get more help and the next steps

Hopefully this guide will have given you a good introduction to the basics of R and made you familiar with the R environment. This guide only touches on what is possible, and if you want to explore R further, there are plenty of resources available to help you out (including guides on this blog!).

If like me, you like to have a good book to refer to, then I can highly recommend R In Action by Robert I. Kabacoff. This book will tell you everything you need to know about using R, starting with an introduction to the basics, moving on to advance techniques. It is clearly written and easy to follow, and serves as a useful reference book. Although its a little old now, it is still a useful reference.

If you enjoyed this guide and wanted to purchase this book, please consider using the affiliate link below.

Other books that may be useful:

An introduction to statistics and how to use them with R. Nice and easy to follow if you're not sure which stats to use.

An introduction to R for Spatial Analysis and Mapping - Useful book for getting started using R for spatial analysis, mapping, and GIS.

As well as books, there are loads of free guides online which give an expanded introduction to R, and the best place to start is the official R introduction.

If you are ever stuck on how to do something, Stack Overflow is a great resource for solving coding problems. It can almost be guaranteed that your question already has a solution on Stack Overflow.

You can also get plenty of help within R itself. This is probably the best source of help to explain any R function. Just just the help command: ? followed by the function name e.g. ?mean to load the help page for the mean() function.

The next steps you might want to explore are importing data, and producing graphics.

This introduction has focussed on examples where you create your own data. But, it is likely that you already have data in a spreadsheet that you want to manipulate in R, so how do you get this data into R? Check out part 2 of this guide to find out how.

As well as data manipulation and analysis, R has extremely powerful plotting functions for producing high quality graphics. Check out part 3 of this guide for an overview.

Thanks for reading this guide and please leave any comments below.