Benjamin Bell: Blog: Getting started with R: Importing data into R

Getting started with R: Importing data into R

by Ben on Tuesday, August 24, 2021

In the second part of this expanded guide series introducing you to R, this guide shows you how to import data into R from csv files and Excel spreadsheets.

Of course, R can import many other data types, but these two are very common and will get you started!

Guide Information

Title	Getting started with R: Importing data into R
Author	Benjamin Bell
Published	August 23, 2021
Last updated
R version	4.1.1
Packages	base; openxlsx
Navigation	R and data formats Comma-separated values (csv) files Exporting your data to csv Importing a csv file into R Additional arguments Data classes Excel files Additional arguments Data classes Further reading

This is a 3 part guide:

Part 1: Introduction to R	An introduction to R - what is R, where to get R, how to get started using R, a look at objects and functions, an overview of data structures, and getting started with manipulating data.
Part 2: Importing data into R	This guide shows you how to import your existing data into R from .csv or Excel files.
Part 3: An overview of graphics in R	This guide gives you an overview of some of the graphical capabilities of base R to produce high quality plots.

R and data formats

If you followed part 1 of this guide, you would have created objects with new data. But, it is likely you have existing data that you would like to import into R. Luckily, R can read almost any kind of data format, either via base functionality or through additional packages.

These data formats may include plain text files, fixed-width files, comma-separated value (csv) files, spreadsheets (e.g. Excel), other statistical package file formats (e.g. Minitab, SPSS), database files (e.g. SQL), netCDF and HDF5 files, image files, pdf, HTML and XML files.

A complete guide for importing and exporting data for loads of file formats is available on the official website. Even if you are using an "obscure" data format, chances are that a package exists which will allow you to import the data into R.

Since trying to cover every data format would be near impossible, this guide covers two main data formats: The "universal" comma-separated values (csv) file format - showing you how to save your data to csv and import it into R; and Excel files - showing you how to import them into R.

Comma-separated values (csv) files

A comma-separated values (csv) file stores tabular data in plain text, which are usually separated by commas (In Windows, a csv file would have the .csv extension). Consider a spreadsheet, each cell stores a value or observation, in a csv file, it is these that are separated by commas. The files may contain any kind of data (numeric, character, logical).

Although there is no official "standard" for csv files, the format is widely supported across different operating systems and different software, including R.

Therefore, the easiest way to get your data into R, is to first export it to a csv file.

Exporting your data to csv

Almost all software programs which manage tabular data will offer a way to export the data to a csv file. Usually, the csv format will be found if you use the Save as.. or Export options when you save your file. For specific instructions check the software manual.

To export an Excel spreadsheet (using Windows), choose File, then Save As, then choose a location to save the file (put the file in the same directory as your R project). The "Save As" dialog will appear. Where it says Save as type: click the dropdown menu and select CSV (comma delimited) (*.csv), choose a filename and click the Save button. You'll now have saved your data in csv format.

You'll only be able to save the active worksheet to a csv file. If your Excel workbook has multiple worksheets, you'll need to save these separately.

To export a Google Sheets spreadsheet (online), choose File, then Download as, then Comma-separated values (.csv, current sheet). The file will then be downloaded, and you should place it in your working directory for your R project.

Importing a csv file into R

There are several options for importing the data contained within a csv or delimited file into R, for example: read.table(), read.csv(), or read.delim() functions can be used. See ?read.table for full details.

The default arguments applied to these functions varies depending on the function used, otherwise they can all be used for importing data.

For this guide, we'll focus on using read.csv() and read.csv2(). IMPORTANT! You should use read.csv() if your csv file is delimited by comma's (,) and the decimal points are periods (.). If your csv file is delimited by semi-colon's (;) and decimal points are comma's, you should instead use read.csv2() to import your file.

For these examples, we'll import a csv file named "holiday.csv", which contains weather data for July for three cities and qualitative cost information.

place	temp	precipitation	cost
Barcelona	23	13.1	Medium
Manchester	20	56.8	Low
New York	25	104.6	High

To import this into R, you could use the following code:

# Import csv file into R
holiday <- read.csv("holiday.csv")

This code will import the data contained in the csv file named "holiday.csv" and create a new data frame object named holiday. The csv file should be in the same working directory as your R project, otherwise you would also need to specify the complete file path.

Not included in this code is the default arguments, as these will apply automatically each time you use read.csv(), unless you specify them differently. These arguments include:

header

This is a logical argument to specify whether the first line of the file contains the names of the variables. The default value is TRUE

sep

This argument allows you to specify the character to use as the field separator.

The default is "," for read.csv() and ";" for read.csv2()

dec

This argument allows you to specify the character to use as the decimal point.

The default is "." for read.csv() and "," for read.csv2()

fill

This is a logical argument to specify where rows have unequal lengths, whether blank fields should be created. The default is TRUE

The csv file that you import may already contain row names and column names, or you may wish to create these names when you import the data, so you could use the following arguments:

row.names

This argument allows you to specify the row names to use for the imported data. You can either specify a column containing the names (either using its name, or the column number), or you can specify a vector of names.

col.names

This argument allows you to specify the column names to use for the imported data. You can either specify a row containing the names (either using its name, or the row number), or you can specify a vector of names.

Consider the following examples:

# Import csv file into R
holiday <- read.csv("holiday.csv", row.names=1)

This code would tell R that the first column of the csv file should become the row names.

Alternatively, you could specify the name of the column to use as row names:

# Import csv file into R
holiday <- read.csv("holiday.csv", row.names="place")

Both examples would create the same data frame since the first column is named "place".

You can also create a vector of names, and use this as either the row names or column names.

Here's an example to specify column names using the col.names argument. Important! If your csv file already contains column names, they will be overwritten by the names specified in the vector. If your csv file does not contain column names you should add header=FALSE to your code:

# Create column names
mynames <- c("Location", "Temperature (C)", "Rainfall (mm)", "Cost") 
# Import data
holiday2 <- read.csv("holiday.csv", col.names=mynames)

In this example, the column names from the csv file are replaced by those in the mynames vector. This example code does not specify the row names, so R will instead use default values (numbered rows).

Lets compare the two data frames:

> holiday
           temp rainfall   cost
Barcelona    23     13.1 Medium
Manchester   20     56.8    Low
New York     25    104.6   High

> holiday2
    Location Temperature..C. Rainfall..mm.   Cost
1  Barcelona              23          13.1 Medium
2 Manchester              20          56.8    Low
3   New York              25         104.6   High

In the first data frame, the place names are used as the row names, while in the second data, the place names are their own column named "Location", while the row names are now default values.

You'll notice that in the second data frame the column names have changed to the ones specified in the mynames vector. However, there is a problem as the brackets (and space between the brackets) have been replaced by a period. This is because R performs a check on the names to make sure they are "syntactically valid" and will rename them if necessary. You can override this check by adding the argument check.names=FALSE. Although, it is usually a good idea to use syntactically valid names.

Additional arguments for importing a csv

The above example code is often all you'll need to use when importing data into R. Of course, this depends on the data you are importing, and so you may need to consider some additional arguments:

na.strings

This argument allows you to specify the value or character which denotes a blank field, and which will be converted to NA values on import.

The default is "" (i.e. blank field)

For example, if your csv file used a different value to specify a blank field, such as -9999, you would use na.strings=-9999. You can also specify multiple values to represent blank fields:

data <- read.csv("filename.csv", na.strings=c(-9999, "", "na"))

nrows

This argument allows you to specify how many rows of the csv file to import. By default, it will import all rows contained in the file.

If you only wanted to import a limited number of rows, e.g. the first 100 rows, add nrows=100 to your code.

skip

This argument allows you to specify how many rows of the csv file to skip when importing the file (i.e. these data rows will not be imported).

If you wanted to skip the first 100 rows, add skip=100 to your code.

You could also combine these arguments to skip a number of rows, and then only import a specified number of rows (after the skipped rows). Consider the following example using the "holiday.csv" file:

holiday.s <- read.csv("holiday.csv", col.names=mynames, row.names=1, skip=1, nrows=1)

The resulting data frame would now only include the "Manchester" row. This is because we have told R to skip the first row, which was "Barcelona". Then we told R to only import the first row (after the skipped row), which would now be "Manchester". Consequently the last row of the csv file ("New York") is not imported.

Technically, the csv file contains 4 rows, but the first row is the header, and when header=TRUE, this row is automatically skipped. Important! You should always specify the column names when skipping rows.

blank.lines.skip

This is a logical argument to specify whether blank rows in your csv file should be automatically skipped when imported into R. The default is TRUE

Data classes

R will assign a "class" to the columns based on the data they contain when the csv file is imported. By default, character data will be converted to "factors", and numbers will be converted to numerical data classes.

Consider the first "holiday.csv" example, the "temp" and "rainfall" columns contain discrete numeric data. The numeric data in the "temp" column are whole numbers, so it is converted to the class "integer", while "rainfall" is classed as "numeric" since it contains numbers with decimal places (this may also be referred to as "double" class). You don't really need to worry about numeric data classes, R will automatically convert between them.

The "cost" column contains categorical data, which R has assigned as a "factor". To check the classes of all the columns in the holiday data frame, use the following code:

sapply(holiday, class)

Which will result in the following output:

     temp  rainfall      cost 
"integer" "numeric"  "factor"

For the next example, two additional columns have been added to the csv file:

place	temp	precipitation	cost	visited	notes
Barcelona	23	13.1	Medium	TRUE	Nice beaches
Manchester	20	56.8	Low	TRUE	Rainy!
New York	25	104.6	High	FALSE	Fun

If you now import this csv file into R and check the classes that have been assigned, you'll get the following output:

     temp  rainfall      cost   visited     notes 
"integer" "numeric"  "factor" "logical"  "factor"

Since the "visited" column contains logical data, it has been assigned the logical class. The "notes" column has been assigned as a factor, however, this column contains character data that is not categorical.

You could add the argument stringsAsFactors=FALSE to your code to stop character data becoming factor classes. However, this would change all the columns with character data, and we want to keep the "cost" column as a factor.

Another way to override the default class assignments, is to specify them when you import the data using the colClasses argument. For example:

# Define classes (including for the row names)
class <- c("character","integer","numeric","factor","logical","character")
# Import data
holiday3 <- read.csv("holiday.csv", row.names=1, colClasses=class)

Now the classes will be assigned exactly how we want. If you wanted to assign the columns as the same class or different classes, you could of course do this by adjusting the code.

When specifying classes for your data, you should also specify a class for the column containing the row names otherwise you'll get errors.

Generally, you shouldn't need to worry about classes when importing data, but if importing very large datasets with many data types, it may become more important.

Excel files

You can import Excel files directly into R using a package. Several packages are available which can read (and write) Excel files. This guide will show you how to import Excel files using the "openxlsx" package, since it is fast and easy to use.

To download and open the package, use the following code in R (you only have to download the package once).

# Download package (do this only once)
install.packages("openxlsx")
# Run the package (do this each time you start a new R session)
library("openxlsx")

Once the package is installed and loaded, you can use the read.xlsx() function to import Excel files into R. For these examples, we'll use an Excel file named "holiday.xlsx" which contains the same data as the csv file from the previous example.

To import the Excel file, use the following code:

holiday <- read.xlsx("holiday.xlsx")

This code will create a new data frame object holiday from the Excel file named "holiday.xlsx". The Excel file should be in the same working directory as your R project, otherwise you will need to specify the complete file path.

There are lots of arguments (options) available when using this function, including the following default arguments:

sheet

This argument determines which Excel worksheet to import. You can either reference the sheet by its index (number), or using the name of the worksheet. The default value is 1 which means it will import the first worksheet, unless you specify otherwise.

startRow

This argument tells R which row in the Excel file to start importing data from. The default value is 1 which means it will import data from the first row. The function will always skip blank rows at the top of an Excel spreadsheet.

colNames

This is a logical argument to determine whether or not to use the first row of data in the Excel file for column names. The default value is TRUE

When you import a csv file, you are able to specify a vector for column names. However, for Excel files, since colNames is a logical argument, you cannot specify a vector containing the column names when you import the file. In order to use a vector for column names, use the following code:

# First import the Excel file
holiday <- read.xlsx("holiday.xlsx")
# Define column names
mynames <- c("Location", "Temperature", "Rainfall", "Cost", "Visited", "Notes")
# Change column names for the data frame
colnames(holiday) <- mynames

In this code, after importing the Excel file and creating the column names, you use the colnames() function to change the column names of the data frame object (enclosed in brackets of the function), by the specified vector (which is after the assignment operator).

rowNames

This is a logical argument to determine whether or not to use the first column of the Excel file as row names in the data frame object. The default value is FALSE

In the example above, the data frame holiday would have been created using default row names (numbers), so to use the names from the first column of the Excel file, use the following code:

holiday <- read.xlsx("holiday.xlsx", rowNames=TRUE)

Like with column names, if you want to use a vector to specify row names, you have to do this after you import the file:

# First import the Excel file
holiday <- read.xlsx("holiday.xlsx")
# Define row names
myrownames <- c("Row 1", "Row 2", "Row 3")
# Change row names for the data frame
rownames(holiday) <- myrownames

Additional arguments

The above code is often all you'll need to import Excel files into R. But, there are additional arguments available which you may need to use.

na.strings

This argument allows you to specify the value or character which denotes a blank field, and which will be converted to NA values on import.

The default is "" (i.e. blank field). This argument works in the same way as it does for read.csv()

rows

This argument allows you to specify which rows to import from the Excel file. By default it will import all rows.

When specifying the number of rows to import, it is important to take account of the header row. This is technically row 1, and you need to import this row as well as the data rows you want. For example:

# Import just the first row of data 
holiday <- read.xlsx("holiday.xlsx", rows=1:2)

# Import the first and third data rows
holiday <- read.xlsx("holiday.xlsx", rows=c(1:2, 4))

In the first example, to import only the first row of data, you actually have to import the first two rows (the header row and first row of data). In the second example, you import the first two rows (to get the first row of data), and then import the fourth row.

Not accounting for the header row can cause issues. Compare the following data frames which illustrate this issue:

# Correct method to import first row
holiday <- read.xlsx("holiday.xlsx", rows=1:2)

      place temp rainfall   cost visited       notes
1 Barcelona   23     13.1 Medium    TRUE Nice beaches

# Import first row of data - but it will become the column names
holiday <- read.xlsx("holiday.xlsx", rows=2)

[1] Barcelona   23          13.1        Medium      TRUE        Nice.beaches
<0 rows> (or 0-length row.names)

# Import first row of data, and specify colNames=FALSE
holiday <- read.xlsx("holiday.xlsx", rows=2, colNames=FALSE)

         X1 X2   X3     X4   X5          X6
1 Barcelona 23 13.1 Medium TRUE Nice beaches

As you can see the first example produces the correct result, with the data frame containing only the first row of data and retaining the column names. In the second example, the data becomes the column names. In the final example, we get the correct result (the data frame only contains the first row of data), but it does not have the column names.

cols

This argument allows you to specify which columns to import. This works in the same way as the rows argument. By default it will import all columns.

skipEmptyRows

This argument determines whether to skip empty rows. The default is TRUE. If this argument is set to FALSE, empty rows will be converted to NA values.

skipEmptyCols

This works in the same way as above, except for columns.

check.names

This is a logical argument as to whether R should check that the column names are "syntactically valid" and rename them where they are not.

This function works in the same way as it does for read.csv(), except the default value is FALSE instead of TRUE (i.e. it does not check the names). Although, it is usually a good idea to use syntactically valid names.

Data classes

As with the read.csv() function, data classes are automatically assigned to data imported from Excel files. Although, the classes assigned may differ - character strings are not converted to factors for example.

These are the classes assigned to the data imported from Excel:

      place        temp    rainfall        cost     visited       notes 
"character"   "numeric"   "numeric" "character"   "logical" "character"

You cannot change the classes when importing the Excel file. You would need to do this after. For example:

# Change "cost" column to factor
holiday$cost <- factor(holiday$cost)

To understand how this works, lets highlight the relevant parts of the code:

# Data frame object
holiday$cost <- factor(holiday$cost)
# The column that you want to change
holiday$cost <- factor(holiday$cost)
# The class you want to change it to
holiday$cost <- factor(holiday$cost)

Now if you were to check the data classes, the "cost" column would now be a factor class. For further details about these functions, check the help pages within R.

You should now be familiar with importing data into R - check back to the first part of this guide to try some of the examples with your own data, or check out part 3 for an overview of R graphics and plotting data.

Thanks for reading this guide and please leave any comments below.

This is a 3 part guide:

Part 1: Introduction to R	An introduction to R - what is R, where to get R, how to get started using R, a look at objects and functions, an overview of data structures, and getting started with manipulating data.
Part 2: Importing data into R	This guide shows you how to import your existing data into R from .csv or Excel files.
Part 3: An overview of graphics in R	This guide gives you an overview of some of the graphical capabilities of base R to produce high quality plots.

Benjamin Bell: Blog

Pages (Desktop)

Pages (Mobile)

Getting started with R: Importing data into R

Guide Information

R and data formats

Comma-separated values (csv) files

Exporting your data to csv

Importing a csv file into R

Additional arguments for importing a csv

Data classes

Excel files

Additional arguments

Data classes

Further reading

No comments

Post a Comment