Pages (Desktop)

Pages (Mobile)

Getting started with R: Importing data into R

In the second part of this expanded guide series introducing you to R, this guide shows you how to import data into R from csv files and Excel spreadsheets.

Of course, R can import many other data types, but these two are very common and will get you started!

Guide Information

Title Getting started with R: Importing data into R
Author Benjamin Bell
Published August 23, 2021
Last updated
R version 4.1.1
Packages base; openxlsx

This is a 3 part guide:

Part 1: Introduction to R An introduction to R - what is R, where to get R, how to get started using R, a look at objects and functions, an overview of data structures, and getting started with manipulating data.
Part 2: Importing data into R This guide shows you how to import your existing data into R from .csv or Excel files.
Part 3: An overview of graphics in R This guide gives you an overview of some of the graphical capabilities of base R to produce high quality plots.

R and data formats

If you followed part 1 of this guide, you would have created objects with new data. But, it is likely you have existing data that you would like to import into R. Luckily, R can read almost any kind of data format, either via base functionality or through additional packages.

These data formats may include plain text files, fixed-width files, comma-separated value (csv) files, spreadsheets (e.g. Excel), other statistical package file formats (e.g. Minitab, SPSS), database files (e.g. SQL), netCDF and HDF5 files, image files, pdf, HTML and XML files.

A complete guide for importing and exporting data for loads of file formats is available on the official website. Even if you are using an "obscure" data format, chances are that a package exists which will allow you to import the data into R.

Since trying to cover every data format would be near impossible, this guide covers two main data formats: The "universal" comma-separated values (csv) file format - showing you how to save your data to csv and import it into R; and Excel files - showing you how to import them into R.

Comma-separated values (csv) files

A comma-separated values (csv) file stores tabular data in plain text, which are usually separated by commas (In Windows, a csv file would have the .csv extension). Consider a spreadsheet, each cell stores a value or observation, in a csv file, it is these that are separated by commas. The files may contain any kind of data (numeric, character, logical).

Although there is no official "standard" for csv files, the format is widely supported across different operating systems and different software, including R.

Therefore, the easiest way to get your data into R, is to first export it to a csv file.

Exporting your data to csv

You'll only be able to save the active worksheet to a csv file. If your Excel workbook has multiple worksheets, you'll need to save these separately.

Importing a csv file into R

There are several options for importing the data contained within a csv or delimited file into R, for example: read.table(), read.csv(), or read.delim() functions can be used. See ?read.table for full details.

The default arguments applied to these functions varies depending on the function used, otherwise they can all be used for importing data.

For this guide, we'll focus on using read.csv() and read.csv2(). IMPORTANT! You should use read.csv() if your csv file is delimited by comma's (,) and the decimal points are periods (.). If your csv file is delimited by semi-colon's (;) and decimal points are comma's, you should instead use read.csv2() to import your file.

For these examples, we'll import a csv file named "holiday.csv", which contains weather data for July for three cities and qualitative cost information.

place temp precipitation cost
Barcelona 23 13.1 Medium
Manchester 20 56.8 Low
New York 25 104.6 High

To import this into R, you could use the following code:

# Import csv file into R
holiday <- read.csv("holiday.csv")

This code will import the data contained in the csv file named "holiday.csv" and create a new data frame object named holiday. The csv file should be in the same working directory as your R project, otherwise you would also need to specify the complete file path.

Not included in this code is the default arguments, as these will apply automatically each time you use read.csv(), unless you specify them differently. These arguments include:

header
This is a logical argument to specify whether the first line of the file contains the names of the variables. The default value is TRUE
sep
This argument allows you to specify the character to use as the field separator.

The default is "," for read.csv() and ";" for read.csv2()
dec
This argument allows you to specify the character to use as the decimal point.

The default is "." for read.csv() and "," for read.csv2()
fill
This is a logical argument to specify where rows have unequal lengths, whether blank fields should be created. The default is TRUE

The csv file that you import may already contain row names and column names, or you may wish to create these names when you import the data, so you could use the following arguments:

row.names
This argument allows you to specify the row names to use for the imported data. You can either specify a column containing the names (either using its name, or the column number), or you can specify a vector of names.
col.names
This argument allows you to specify the column names to use for the imported data. You can either specify a row containing the names (either using its name, or the row number), or you can specify a vector of names.

Consider the following examples:

# Import csv file into R
holiday <- read.csv("holiday.csv", row.names=1)

This code would tell R that the first column of the csv file should become the row names.

Alternatively, you could specify the name of the column to use as row names:

# Import csv file into R
holiday <- read.csv("holiday.csv", row.names="place")

Both examples would create the same data frame since the first column is named "place".

You can also create a vector of names, and use this as either the row names or column names.

Here's an example to specify column names using the col.names argument. Important! If your csv file already contains column names, they will be overwritten by the names specified in the vector. If your csv file does not contain column names you should add header=FALSE to your code:

# Create column names
mynames <- c("Location", "Temperature (C)", "Rainfall (mm)", "Cost") 
# Import data
holiday2 <- read.csv("holiday.csv", col.names=mynames)

In this example, the column names from the csv file are replaced by those in the mynames vector. This example code does not specify the row names, so R will instead use default values (numbered rows).

Lets compare the two data frames:

> holiday
           temp rainfall   cost
Barcelona    23     13.1 Medium
Manchester   20     56.8    Low
New York     25    104.6   High
> holiday2
    Location Temperature..C. Rainfall..mm.   Cost
1  Barcelona              23          13.1 Medium
2 Manchester              20          56.8    Low
3   New York              25         104.6   High

In the first data frame, the place names are used as the row names, while in the second data, the place names are their own column named "Location", while the row names are now default values.

You'll notice that in the second data frame the column names have changed to the ones specified in the mynames vector. However, there is a problem as the brackets (and space between the brackets) have been replaced by a period. This is because R performs a check on the names to make sure they are "syntactically valid" and will rename them if necessary. You can override this check by adding the argument check.names=FALSE. Although, it is usually a good idea to use syntactically valid names.

Additional arguments for importing a csv

The above example code is often all you'll need to use when importing data into R. Of course, this depends on the data you are importing, and so you may need to consider some additional arguments:

na.strings
This argument allows you to specify the value or character which denotes a blank field, and which will be converted to NA values on import.

The default is "" (i.e. blank field)

For example, if your csv file used a different value to specify a blank field, such as -9999, you would use na.strings=-9999. You can also specify multiple values to represent blank fields:

data <- read.csv("filename.csv", na.strings=c(-9999, "", "na"))
nrows
This argument allows you to specify how many rows of the csv file to import. By default, it will import all rows contained in the file.

If you only wanted to import a limited number of rows, e.g. the first 100 rows, add nrows=100 to your code.
skip
This argument allows you to specify how many rows of the csv file to skip when importing the file (i.e. these data rows will not be imported).

If you wanted to skip the first 100 rows, add skip=100 to your code.

You could also combine these arguments to skip a number of rows, and then only import a specified number of rows (after the skipped rows). Consider the following example using the "holiday.csv" file:

holiday.s <- read.csv("holiday.csv", col.names=mynames, row.names=1, skip=1, nrows=1)

The resulting data frame would now only include the "Manchester" row. This is because we have told R to skip the first row, which was "Barcelona". Then we told R to only import the first row (after the skipped row), which would now be "Manchester". Consequently the last row of the csv file ("New York") is not imported.

Technically, the csv file contains 4 rows, but the first row is the header, and when header=TRUE, this row is automatically skipped. Important! You should always specify the column names when skipping rows.

blank.lines.skip
This is a logical argument to specify whether blank rows in your csv file should be automatically skipped when imported into R. The default is TRUE

Data classes

R will assign a "class" to the columns based on the data they contain when the csv file is imported. By default, character data will be converted to "factors", and numbers will be converted to numerical data classes.

Consider the first "holiday.csv" example, the "temp" and "rainfall" columns contain discrete numeric data. The numeric data in the "temp" column are whole numbers, so it is converted to the class "integer", while "rainfall" is classed as "numeric" since it contains numbers with decimal places (this may also be referred to as "double" class). You don't really need to worry about numeric data classes, R will automatically convert between them.

The "cost" column contains categorical data, which R has assigned as a "factor". To check the classes of all the columns in the holiday data frame, use the following code:

sapply(holiday, class)

Which will result in the following output:

     temp  rainfall      cost 
"integer" "numeric"  "factor" 

For the next example, two additional columns have been added to the csv file:

place temp precipitation cost visited notes
Barcelona 23 13.1 Medium TRUE Nice beaches
Manchester 20 56.8 Low TRUE Rainy!
New York 25 104.6 High FALSE Fun

If you now import this csv file into R and check the classes that have been assigned, you'll get the following output:

     temp  rainfall      cost   visited     notes 
"integer" "numeric"  "factor" "logical"  "factor"

Since the "visited" column contains logical data, it has been assigned the logical class. The "notes" column has been assigned as a factor, however, this column contains character data that is not categorical.

You could add the argument stringsAsFactors=FALSE to your code to stop character data becoming factor classes. However, this would change all the columns with character data, and we want to keep the "cost" column as a factor.

Another way to override the default class assignments, is to specify them when you import the data using the colClasses argument. For example:

# Define classes (including for the row names)
class <- c("character","integer","numeric","factor","logical","character")
# Import data
holiday3 <- read.csv("holiday.csv", row.names=1, colClasses=class)

Now the classes will be assigned exactly how we want. If you wanted to assign the columns as the same class or different classes, you could of course do this by adjusting the code.

When specifying classes for your data, you should also specify a class for the column containing the row names otherwise you'll get errors.

Generally, you shouldn't need to worry about classes when importing data, but if importing very large datasets with many data types, it may become more important.

Excel files

You can import Excel files directly into R using a package. Several packages are available which can read (and write) Excel files. This guide will show you how to import Excel files using the "openxlsx" package, since it is fast and easy to use.

To download and open the package, use the following code in R (you only have to download the package once).

# Download package (do this only once)
install.packages("openxlsx")
# Run the package (do this each time you start a new R session)
library("openxlsx")

Once the package is installed and loaded, you can use the read.xlsx() function to import Excel files into R. For these examples, we'll use an Excel file named "holiday.xlsx" which contains the same data as the csv file from the previous example.

To import the Excel file, use the following code:

holiday <- read.xlsx("holiday.xlsx")

This code will create a new data frame object holiday from the Excel file named "holiday.xlsx". The Excel file should be in the same working directory as your R project, otherwise you will need to specify the complete file path.

There are lots of arguments (options) available when using this function, including the following default arguments:

sheet
This argument determines which Excel worksheet to import. You can either reference the sheet by its index (number), or using the name of the worksheet. The default value is 1 which means it will import the first worksheet, unless you specify otherwise.
startRow
This argument tells R which row in the Excel file to start importing data from. The default value is 1 which means it will import data from the first row. The function will always skip blank rows at the top of an Excel spreadsheet.
colNames
This is a logical argument to determine whether or not to use the first row of data in the Excel file for column names. The default value is TRUE

When you import a csv file, you are able to specify a vector for column names. However, for Excel files, since colNames is a logical argument, you cannot specify a vector containing the column names when you import the file. In order to use a vector for column names, use the following code:

# First import the Excel file
holiday <- read.xlsx("holiday.xlsx")
# Define column names
mynames <- c("Location", "Temperature", "Rainfall", "Cost", "Visited", "Notes")
# Change column names for the data frame
colnames(holiday) <- mynames

In this code, after importing the Excel file and creating the column names, you use the colnames() function to change the column names of the data frame object (enclosed in brackets of the function), by the specified vector (which is after the assignment operator).

rowNames
This is a logical argument to determine whether or not to use the first column of the Excel file as row names in the data frame object. The default value is FALSE

In the example above, the data frame holiday would have been created using default row names (numbers), so to use the names from the first column of the Excel file, use the following code:

holiday <- read.xlsx("holiday.xlsx", rowNames=TRUE)

Like with column names, if you want to use a vector to specify row names, you have to do this after you import the file:

# First import the Excel file
holiday <- read.xlsx("holiday.xlsx")
# Define row names
myrownames <- c("Row 1", "Row 2", "Row 3")
# Change row names for the data frame
rownames(holiday) <- myrownames

Additional arguments

The above code is often all you'll need to import Excel files into R. But, there are additional arguments available which you may need to use.

na.strings
This argument allows you to specify the value or character which denotes a blank field, and which will be converted to NA values on import.

The default is "" (i.e. blank field). This argument works in the same way as it does for read.csv()
rows
This argument allows you to specify which rows to import from the Excel file. By default it will import all rows.

When specifying the number of rows to import, it is important to take account of the header row. This is technically row 1, and you need to import this row as well as the data rows you want. For example:

# Import just the first row of data 
holiday <- read.xlsx("holiday.xlsx", rows=1:2)

# Import the first and third data rows
holiday <- read.xlsx("holiday.xlsx", rows=c(1:2, 4))

In the first example, to import only the first row of data, you actually have to import the first two rows (the header row and first row of data). In the second example, you import the first two rows (to get the first row of data), and then import the fourth row.

Not accounting for the header row can cause issues. Compare the following data frames which illustrate this issue:

# Correct method to import first row
holiday <- read.xlsx("holiday.xlsx", rows=1:2)
      place temp rainfall   cost visited       notes
1 Barcelona   23     13.1 Medium    TRUE Nice beaches
# Import first row of data - but it will become the column names
holiday <- read.xlsx("holiday.xlsx", rows=2)
[1] Barcelona   23          13.1        Medium      TRUE        Nice.beaches
<0 rows> (or 0-length row.names)
# Import first row of data, and specify colNames=FALSE
holiday <- read.xlsx("holiday.xlsx", rows=2, colNames=FALSE)
         X1 X2   X3     X4   X5          X6
1 Barcelona 23 13.1 Medium TRUE Nice beaches

As you can see the first example produces the correct result, with the data frame containing only the first row of data and retaining the column names. In the second example, the data becomes the column names. In the final example, we get the correct result (the data frame only contains the first row of data), but it does not have the column names.

cols
This argument allows you to specify which columns to import. This works in the same way as the rows argument. By default it will import all columns.
skipEmptyRows
This argument determines whether to skip empty rows. The default is TRUE. If this argument is set to FALSE, empty rows will be converted to NA values.
skipEmptyCols
This works in the same way as above, except for columns.
check.names
This is a logical argument as to whether R should check that the column names are "syntactically valid" and rename them where they are not.

This function works in the same way as it does for read.csv(), except the default value is FALSE instead of TRUE (i.e. it does not check the names). Although, it is usually a good idea to use syntactically valid names.

Data classes

As with the read.csv() function, data classes are automatically assigned to data imported from Excel files. Although, the classes assigned may differ - character strings are not converted to factors for example.

These are the classes assigned to the data imported from Excel:

      place        temp    rainfall        cost     visited       notes 
"character"   "numeric"   "numeric" "character"   "logical" "character" 

You cannot change the classes when importing the Excel file. You would need to do this after. For example:

# Change "cost" column to factor
holiday$cost <- factor(holiday$cost)

To understand how this works, lets highlight the relevant parts of the code:

# Data frame object
holiday$cost <- factor(holiday$cost)
# The column that you want to change
holiday$cost <- factor(holiday$cost)
# The class you want to change it to
holiday$cost <- factor(holiday$cost)

Now if you were to check the data classes, the "cost" column would now be a factor class. For further details about these functions, check the help pages within R.

You should now be familiar with importing data into R - check back to the first part of this guide to try some of the examples with your own data, or check out part 3 for an overview of R graphics and plotting data.

Thanks for reading this guide and please leave any comments below.

This is a 3 part guide:

Part 1: Introduction to R An introduction to R - what is R, where to get R, how to get started using R, a look at objects and functions, an overview of data structures, and getting started with manipulating data.
Part 2: Importing data into R This guide shows you how to import your existing data into R from .csv or Excel files.
Part 3: An overview of graphics in R This guide gives you an overview of some of the graphical capabilities of base R to produce high quality plots.
© Benjamin Bell. All Rights Reserved. http://www.benjaminbell.co.uk

Further reading

A quick guide to pch symbols - A quick guide to the different pch symbols which are available in R, and how to use them. [R Graphics]

A quick guide to line types (lty) - A quick guide to the different line types available in R, and how to use them. [R Graphics]

Extracting CRU climate data - A 4 part guide series which shows you how to download climate data and analyse it with R.

Pollen diagrams using rioja - Part 1 of a 3 part guide series where I show you how to plot pollen diagrams using rioja.

Principal components analysis (PCA) in R - A guide showing you how to perform PCA in R, and how to create great looking biplots.

No comments

Post a Comment

Comments are moderated. There may be a delay until your comment appears.