In the second part of this expanded guide series introducing you to R, this guide shows you how to import data into R from csv files and Excel spreadsheets.
Of course, R can import many other data types, but these two are very common and will get you started!
Guide Information
Title | Getting started with R: Importing data into R |
Author | Benjamin Bell |
Published | August 23, 2021 |
Last updated | |
R version | 4.1.1 |
Packages | base; openxlsx |
Navigation |
This is a 3 part guide:
Part 1: Introduction to R | An introduction to R - what is R, where to get R, how to get started using R, a look at objects and functions, an overview of data structures, and getting started with manipulating data. |
Part 2: Importing data into R | This guide shows you how to import your existing data into R from .csv or Excel files. |
Part 3: An overview of graphics in R | This guide gives you an overview of some of the graphical capabilities of base R to produce high quality plots. |
R and data formats
If you followed part 1 of this guide, you would have created objects with new data. But, it is likely you have existing data that you would like to import into R. Luckily, R can read almost any kind of data format, either via base functionality or through additional packages.
These data formats may include plain text files, fixed-width files, comma-separated value (csv) files, spreadsheets (e.g. Excel), other statistical package file formats (e.g. Minitab, SPSS), database files (e.g. SQL), netCDF and HDF5 files, image files, pdf, HTML and XML files.
A complete guide for importing and exporting data for loads of file formats is available on the official website. Even if you are using an "obscure" data format, chances are that a package exists which will allow you to import the data into R.
Since trying to cover every data format would be near impossible, this guide covers two main data formats: The "universal" comma-separated values (csv) file format - showing you how to save your data to csv and import it into R; and Excel files - showing you how to import them into R.
Comma-separated values (csv) files
A comma-separated values (csv) file stores tabular data in plain text, which are usually separated by commas (In Windows, a csv file would have the .csv extension). Consider a spreadsheet, each cell stores a value or observation, in a csv file, it is these that are separated by commas. The files may contain any kind of data (numeric, character, logical).
Although there is no official "standard" for csv files, the format is widely supported across different operating systems and different software, including R.
Therefore, the easiest way to get your data into R, is to first export it to a csv file.
Exporting your data to csv
You'll only be able to save the active worksheet to a csv file. If your Excel workbook has multiple worksheets, you'll need to save these separately.
Importing a csv file into R
There are several options for importing the data contained within a csv or delimited file into R, for example: read.table()
, read.csv()
, or read.delim()
functions can be used. See ?read.table
for full details.
The default arguments applied to these functions varies depending on the function used, otherwise they can all be used for importing data.
For this guide, we'll focus on using read.csv()
and read.csv2()
. IMPORTANT! You should use read.csv()
if your csv file is delimited by comma's (,) and the decimal points are periods (.). If your csv file is delimited by semi-colon's (;) and decimal points are comma's, you should instead use read.csv2()
to import your file.
For these examples, we'll import a csv file named "holiday.csv", which contains weather data for July for three cities and qualitative cost information.
place | temp | precipitation | cost |
---|---|---|---|
Barcelona | 23 | 13.1 | Medium |
Manchester | 20 | 56.8 | Low |
New York | 25 | 104.6 | High |
To import this into R, you could use the following code:
# Import csv file into R
holiday <- read.csv("holiday.csv")
This code will import the data contained in the csv file named "holiday.csv" and create a new data frame object named holiday
. The csv file should be in the same working directory as your R project, otherwise you would also need to specify the complete file path.
Not included in this code is the default arguments, as these will apply automatically each time you use read.csv()
, unless you specify them differently. These arguments include:
header
TRUE
sep
The default is
","
for read.csv()
and ";"
for read.csv2()
dec
The default is
"."
for read.csv()
and ","
for read.csv2()
fill
TRUE
The csv file that you import may already contain row names and column names, or you may wish to create these names when you import the data, so you could use the following arguments:
row.names
col.names
Consider the following examples:
# Import csv file into R
holiday <- read.csv("holiday.csv", row.names=1)
This code would tell R that the first column of the csv file should become the row names.
Alternatively, you could specify the name of the column to use as row names:
# Import csv file into R
holiday <- read.csv("holiday.csv", row.names="place")
Both examples would create the same data frame since the first column is named "place".
You can also create a vector of names, and use this as either the row names or column names.
Here's an example to specify column names using the col.names
argument. Important! If your csv file already contains column names, they will be overwritten by the names specified in the vector. If your csv file does not contain column names you should add header=FALSE
to your code:
# Create column names
mynames <- c("Location", "Temperature (C)", "Rainfall (mm)", "Cost")
# Import data
holiday2 <- read.csv("holiday.csv", col.names=mynames)
In this example, the column names from the csv file are replaced by those in the mynames
vector. This example code does not specify the row names, so R will instead use default values (numbered rows).
Lets compare the two data frames:
> holiday
temp rainfall cost
Barcelona 23 13.1 Medium
Manchester 20 56.8 Low
New York 25 104.6 High
> holiday2
Location Temperature..C. Rainfall..mm. Cost
1 Barcelona 23 13.1 Medium
2 Manchester 20 56.8 Low
3 New York 25 104.6 High
In the first data frame, the place names are used as the row names, while in the second data, the place names are their own column named "Location", while the row names are now default values.
You'll notice that in the second data frame the column names have changed to the ones specified in the mynames
vector. However, there is a problem as the brackets (and space between the brackets) have been replaced by a period. This is because R performs a check on the names to make sure they are "syntactically valid" and will rename them if necessary. You can override this check by adding the argument check.names=FALSE
. Although, it is usually a good idea to use syntactically valid names.
Additional arguments for importing a csv
The above example code is often all you'll need to use when importing data into R. Of course, this depends on the data you are importing, and so you may need to consider some additional arguments:
na.strings
NA
values on import.
The default is
""
(i.e. blank field)
For example, if your csv file used a different value to specify a blank field, such as -9999
, you would use na.strings=-9999
. You can also specify multiple values to represent blank fields:
data <- read.csv("filename.csv", na.strings=c(-9999, "", "na"))
nrows
If you only wanted to import a limited number of rows, e.g. the first 100 rows, add
nrows=100
to your code.skip
If you wanted to skip the first 100 rows, add
skip=100
to your code.You could also combine these arguments to skip a number of rows, and then only import a specified number of rows (after the skipped rows). Consider the following example using the "holiday.csv" file:
holiday.s <- read.csv("holiday.csv", col.names=mynames, row.names=1, skip=1, nrows=1)
The resulting data frame would now only include the "Manchester" row. This is because we have told R to skip the first row, which was "Barcelona". Then we told R to only import the first row (after the skipped row), which would now be "Manchester". Consequently the last row of the csv file ("New York") is not imported.
Technically, the csv file contains 4 rows, but the first row is the header, and when header=TRUE
, this row is automatically skipped. Important! You should always specify the column names when skipping rows.
blank.lines.skip
TRUE
Data classes
R will assign a "class" to the columns based on the data they contain when the csv file is imported. By default, character data will be converted to "factors", and numbers will be converted to numerical data classes.
Consider the first "holiday.csv" example, the "temp" and "rainfall" columns contain discrete numeric data. The numeric data in the "temp" column are whole numbers, so it is converted to the class "integer", while "rainfall" is classed as "numeric" since it contains numbers with decimal places (this may also be referred to as "double" class). You don't really need to worry about numeric data classes, R will automatically convert between them.
The "cost" column contains categorical data, which R has assigned as a "factor". To check the classes of all the columns in the holiday
data frame, use the following code:
sapply(holiday, class)
Which will result in the following output:
temp rainfall cost
"integer" "numeric" "factor"
For the next example, two additional columns have been added to the csv file:
place | temp | precipitation | cost | visited | notes |
---|---|---|---|---|---|
Barcelona | 23 | 13.1 | Medium | TRUE | Nice beaches |
Manchester | 20 | 56.8 | Low | TRUE | Rainy! |
New York | 25 | 104.6 | High | FALSE | Fun |
If you now import this csv file into R and check the classes that have been assigned, you'll get the following output:
temp rainfall cost visited notes
"integer" "numeric" "factor" "logical" "factor"
Since the "visited" column contains logical data, it has been assigned the logical class. The "notes" column has been assigned as a factor, however, this column contains character data that is not categorical.
You could add the argument stringsAsFactors=FALSE
to your code to stop character data becoming factor classes. However, this would change all the columns with character data, and we want to keep the "cost" column as a factor.
Another way to override the default class assignments, is to specify them when you import the data using the colClasses
argument. For example:
# Define classes (including for the row names)
class <- c("character","integer","numeric","factor","logical","character")
# Import data
holiday3 <- read.csv("holiday.csv", row.names=1, colClasses=class)
Now the classes will be assigned exactly how we want. If you wanted to assign the columns as the same class or different classes, you could of course do this by adjusting the code.
When specifying classes for your data, you should also specify a class for the column containing the row names otherwise you'll get errors.
Generally, you shouldn't need to worry about classes when importing data, but if importing very large datasets with many data types, it may become more important.
Excel files
You can import Excel files directly into R using a package. Several packages are available which can read (and write) Excel files. This guide will show you how to import Excel files using the "openxlsx" package, since it is fast and easy to use.
To download and open the package, use the following code in R (you only have to download the package once).
# Download package (do this only once)
install.packages("openxlsx")
# Run the package (do this each time you start a new R session)
library("openxlsx")
Once the package is installed and loaded, you can use the read.xlsx()
function to import Excel files into R. For these examples, we'll use an Excel file named "holiday.xlsx" which contains the same data as the csv file from the previous example.
To import the Excel file, use the following code:
holiday <- read.xlsx("holiday.xlsx")
This code will create a new data frame object holiday
from the Excel file named "holiday.xlsx". The Excel file should be in the same working directory as your R project, otherwise you will need to specify the complete file path.
There are lots of arguments (options) available when using this function, including the following default arguments:
sheet
1
which means it will import the first worksheet, unless you specify otherwise.startRow
1
which means it will import data from the first row. The function will always skip blank rows at the top of an Excel spreadsheet.colNames
TRUE
When you import a csv file, you are able to specify a vector for column names. However, for Excel files, since colNames
is a logical argument, you cannot specify a vector containing the column names when you import the file. In order to use a vector for column names, use the following code:
# First import the Excel file
holiday <- read.xlsx("holiday.xlsx")
# Define column names
mynames <- c("Location", "Temperature", "Rainfall", "Cost", "Visited", "Notes")
# Change column names for the data frame
colnames(holiday) <- mynames
In this code, after importing the Excel file and creating the column names, you use the colnames()
function to change the column names of the data frame object (enclosed in brackets of the function), by the specified vector (which is after the assignment operator).
rowNames
FALSE
In the example above, the data frame holiday
would have been created using default row names (numbers), so to use the names from the first column of the Excel file, use the following code:
holiday <- read.xlsx("holiday.xlsx", rowNames=TRUE)
Like with column names, if you want to use a vector to specify row names, you have to do this after you import the file:
# First import the Excel file
holiday <- read.xlsx("holiday.xlsx")
# Define row names
myrownames <- c("Row 1", "Row 2", "Row 3")
# Change row names for the data frame
rownames(holiday) <- myrownames
Additional arguments
The above code is often all you'll need to import Excel files into R. But, there are additional arguments available which you may need to use.
na.strings
NA
values on import.
The default is
""
(i.e. blank field). This argument works in the same way as it does for read.csv()
rows
When specifying the number of rows to import, it is important to take account of the header row. This is technically row 1, and you need to import this row as well as the data rows you want. For example:
# Import just the first row of data
holiday <- read.xlsx("holiday.xlsx", rows=1:2)
# Import the first and third data rows
holiday <- read.xlsx("holiday.xlsx", rows=c(1:2, 4))
In the first example, to import only the first row of data, you actually have to import the first two rows (the header row and first row of data). In the second example, you import the first two rows (to get the first row of data), and then import the fourth row.
Not accounting for the header row can cause issues. Compare the following data frames which illustrate this issue:
# Correct method to import first row
holiday <- read.xlsx("holiday.xlsx", rows=1:2)
place temp rainfall cost visited notes
1 Barcelona 23 13.1 Medium TRUE Nice beaches
# Import first row of data - but it will become the column names
holiday <- read.xlsx("holiday.xlsx", rows=2)
[1] Barcelona 23 13.1 Medium TRUE Nice.beaches
<0 rows> (or 0-length row.names)
# Import first row of data, and specify colNames=FALSE
holiday <- read.xlsx("holiday.xlsx", rows=2, colNames=FALSE)
X1 X2 X3 X4 X5 X6
1 Barcelona 23 13.1 Medium TRUE Nice beaches
As you can see the first example produces the correct result, with the data frame containing only the first row of data and retaining the column names. In the second example, the data becomes the column names. In the final example, we get the correct result (the data frame only contains the first row of data), but it does not have the column names.
cols
rows
argument. By default it will import all columns.skipEmptyRows
TRUE
. If this argument is set to FALSE
, empty rows will be converted to NA
values.skipEmptyCols
check.names
This function works in the same way as it does for
read.csv()
, except the default value is FALSE
instead of TRUE
(i.e. it does not check the names). Although, it is usually a good idea to use syntactically valid names.Data classes
As with the read.csv()
function, data classes are automatically assigned to data imported from Excel files. Although, the classes assigned may differ - character strings are not converted to factors for example.
These are the classes assigned to the data imported from Excel:
place temp rainfall cost visited notes
"character" "numeric" "numeric" "character" "logical" "character"
You cannot change the classes when importing the Excel file. You would need to do this after. For example:
# Change "cost" column to factor
holiday$cost <- factor(holiday$cost)
To understand how this works, lets highlight the relevant parts of the code:
# Data frame object
holiday$cost <- factor(holiday$cost)
# The column that you want to change
holiday$cost <- factor(holiday$cost)
# The class you want to change it to
holiday$cost <- factor(holiday$cost)
Now if you were to check the data classes, the "cost" column would now be a factor class. For further details about these functions, check the help pages within R.
You should now be familiar with importing data into R - check back to the first part of this guide to try some of the examples with your own data, or check out part 3 for an overview of R graphics and plotting data.
Thanks for reading this guide and please leave any comments below.
This is a 3 part guide:
Part 1: Introduction to R | An introduction to R - what is R, where to get R, how to get started using R, a look at objects and functions, an overview of data structures, and getting started with manipulating data. |
Part 2: Importing data into R | This guide shows you how to import your existing data into R from .csv or Excel files. |
Part 3: An overview of graphics in R | This guide gives you an overview of some of the graphical capabilities of base R to produce high quality plots. |
Further reading
A quick guide to pch symbols - A quick guide to the different pch symbols which are available in R, and how to use them. [R Graphics]
A quick guide to line types (lty) - A quick guide to the different line types available in R, and how to use them. [R Graphics]
Extracting CRU climate data - A 4 part guide series which shows you how to download climate data and analyse it with R.
Pollen diagrams using rioja - Part 1 of a 3 part guide series where I show you how to plot pollen diagrams using rioja.
Principal components analysis (PCA) in R - A guide showing you how to perform PCA in R, and how to create great looking biplots.
No comments
Post a Comment
Comments are moderated. There may be a delay until your comment appears.