In the second part of my guide for principal components analysis (PCA) in R, I additionally cover loadings plots, adding convex hulls to your biplots, more customisation options, and show you some more examples of PCA biplots created using R's base functionality...
Guide Information
Title | Principal Components Analysis (PCA) in R - Part 2 |
Author | Benjamin Bell |
Published | March 06, 2018 |
Last updated | |
R version | 3.4.2 |
Packages | base |
Navigation |
This is a 2 part guide:
Part 1: Principal components analysis (PCA) in R | PCA in R using base functions, and creating beautiful looking biplots. Also covers plotting 95% confidence ellipses. |
Part 2: Principal components analysis (PCA) in R | PCA in R, looking at loadings plots, convex hulls, specifying/limiting labels and/or variable arrows, and more biplot customisations. |
Principal Components Analysis (PCA) in R - Part 2
Part 1 of this guide showed you how to do principal components analysis (PCA) in R, using the prcomp()
function, and how to create a beautiful looking biplot using R's base functionality. If you missed the first part of this guide, check it out here.
The second part of this guide covers loadings plots and adding convex hulls to the biplot, as well as showing some additional customisation options for the PCA biplot. Ideally, you should have read part 1 to follow this guide, or you should already be familiar with the prcomp()
function and the various options for plot()
. This guide will also make use of the dataset used in part 1. To download the dataset, check out the instructions in part 1 of this guide.
Loadings plots
You may want to plot the variable loadings from your PCA as a bar chart to help determine which variables are having the most influence on the principal component. If using the prcomp()
function to carry out the PCA, the loadings are contained in the PCA object as a matrix: $rotation
.
Using the pollen dataset from part 1, we'll firstly recreate the PCA. (See part 1 for a more detailed explanation of this code).
# Import data
ma.pollen.raw <- read.csv("Morocco Pollen Surface Sample Data.csv", header=TRUE, row.names=1, sep=",", check.names=FALSE)
# Tidy/subset imported data
ma.pollen <- ma.pollen.raw[-1:-4] # Remove the first four columns
ma.pollen <- ma.pollen[-49:-51] # Remove columns
# Remove columns where total abundanace is less than 10%
ma.sum <- colSums(ma.pollen) # Calculate column sums
ma.pollen1 <- ma.pollen[, which(ma.sum > 10)] # Subset data
# PCA using base function - prcomp()
p <- prcomp(ma.pollen1, scale=TRUE)
s <- summary(p) # Summary
So in this example, the loadings for the variables for our first principal component (PC) are p$rotation[,1]
, while the second principal component loadings are p$rotation[,2]
and so on.
To plot the variable loadings for PC 1, using default plot options, you could use the following code:
# Loadings plot
barplot(p$rotation[,1], main="PC 1 Loadings Plot", las=2)
Which will result in a simple bar plot:
From this simple plot it is quite easy to see which variables are having a positive and negative influence on the principal component. But, there are several ways to improve the appearance of this plot. However, If you're happy with the default plot, feel free to skip the next part and jump down to adding convex hulls to a PCA biplot.
Firstly, lets deal with the axis labels. The variable names are quite long, so they do not fit on the figure when using the default options. The easiest way to fix this is to either make the bottom margin bigger e.g. par(mar=c(8, 3, 3, 1))
or make the names smaller e.g. cex.names=0.6
, or a combination of both.
Another way is to plot the variable names directly underneath the bars in the bar chart:
n.pc1 <- ifelse(p$rotation[,1] > 0, yes=-0.01, no=p$rotation[,1]-0.01)
In this code, we create a numeric object ("n.pc1") containing values which will position text underneath the bars in the bar chart using the ifelse()
function. The first argument p$rotation[,1] > 0
sets the condition or test for ifelse()
to meet. The yes
argument will return the value specified if the condition is met, while the no
argument will return the value specified if it fails to meet the condition.
So, in this example we are telling R, if the value of the cell in the p$rotation
matrix is greater than 0, it should return a value of "-0.01", otherwise it should return the value of the cell, minus 0.01. Type "n.pc1" into the R console to see the results:
> n.pc1
Cedrus Cupressaceae Olea
-0.2861539 -0.0100000 -0.0100000
Phillyrea Pinus Quercus deciduous
-0.1590976 -0.0100000 -0.3279861
... omitted ...
We can also change the colour of the bars automatically to reflect whether they are having a positive or negative influence.
c.pc1 <- ifelse(p$rotation[,1] > 0, yes="green2", no="red2")
This code works in much the same way as the previous example, except it replaces values with a colour instead of number.
Now, lets re-plot the loading scores using the following code:
par(mar=c(8,3,2,1)) # Set margins
b1 <- barplot(p$rotation[,1], main="PC 1 Loadings Plot", col=c.pc1, las=2, axisnames=FALSE)
abline(h=0) # Add horizontal line
text(x=b1, y=n.pc1, labels=names(p$rotation[,1]), adj=1, srt=90, xpd=TRUE) # Add variable names
In this code, since we are going to add the variable names manually, we use axisnames=FALSE
to stop barplot()
plotting them automatically. We also create the bar chart as an object so that we can extract the "midpoints" of each bar to correctly position the variable names when using the text()
function.
adj=1
sets the alignment of the variable names to right align.
srt=90
changes the direction of the text to a 90 degree angle (vertical).
xpd=TRUE
tells R that it can plot the text outside the plot region, and within the figure region. For more on "regions" and graphic devices, check out my guide to layout().
The resulting plot should look a lot better!
You can plot the second principal component in much the same way as the first. The following code will plot them both together:
# Change colour of bar plot
c.pc1 <- ifelse(p$rotation[,1] > 0, yes="green2", no="red2")
c.pc2 <- ifelse(p$rotation[,2] > 0, "green2", "red2")
# Get position for variable names
n.pc1 <- ifelse(p$rotation[,1] > 0, -0.01, p$rotation[,1]-0.01)
n.pc2 <- ifelse(p$rotation[,2] > 0, -0.01, p$rotation[,2]-0.01)
# Plot
layout(matrix(1:2, ncol=1)) # Set up layout
par(mar=c(1,3,2,1), oma=c(7.5,0,0,0)) # Set up margins
# Plot PC 1
b1 <- barplot(p$rotation[,1], main="PC 1 Loadings Plot", col=c.pc1, las=2, axisnames=FALSE)
abline(h=0)
# Plot PC 2
b2 <- barplot(p$rotation[,2], main="PC 2 Loadings Plot", col=c.pc2, las=2, axisnames=FALSE)
abline(h=0)
# Add variable names
text(x=b2, y=ifelse(p$rotation[,2] > 0, -0.01, p$rotation[,2]-0.01), labels=names(p$rotation[,2]), adj=1, srt=90, xpd=NA)
There are a couple of differences in this code to the previous example. First, in addition to setting the plot margins, we also set the "outer" margins par(oma=c(x, x, x, x))
for the figure. The reason for this, is to allow enough space for the variable names along the bottom axis of the second plot, while keeping the proportions of the two plots the same.
Compare the two figures below which show the plot (red), figure (blue) and device (green) regions. If we omit the oma
argument and instead define a larger bottom border for the second plot, it will appear "squashed" compared to the first plot (left figure). By using oma
, we can set a larger bottom margin while keeping the two plots the same size (right figure).
The other code difference is in the text()
function. Here, we set xpd=NA
instead of TRUE
, which will allow text to be plotted within the device region.
The resulting plot should look like this:
Convex hulls
In part 1 of the guide, we added 95% confidence ellipses to our biplot for each group. Another option is to display convex hulls around each group of variables in the biplot.
To create the convex hull, you need to get the coordinates of the outer most individuals (observations) for each group, using the chull()
function. There are three groups in the pollen dataset, so we first need to create separate matrices containing the coordinates for each group.
# Get individuals (observations), and create group matrices
tab <- matrix(c(p$x[,1], p$x[,2]), ncol=2)
tab1 <- tab[1:16,]
tab2 <- tab[17:30,]
tab3 <- tab[31:33,]
Next, we'll get the outer most individuals for each group.
# Get the outer most individuals
ch1 <- chull(tab1)
ch2 <- chull(tab2)
ch3 <- chull(tab3)
To plot the convex hull on the biplot, use polygon()
, and set the colour as col=NA
and additionally define a border colour using border
. This will stop the polygon from being "filled", and only the border will be drawn.
So, for the pollen dataset PCA, the following code will plot the biplot and add convex hulls. (A more detailed explanation of this code is available in part 1 of this guide).
# Create groups
pch.group <- c(rep(21, times=16), rep(22, times=14), rep(24, times=3))
col.group <- c(rep("skyblue2", times=16), rep("gold", times=14), rep("green2", times=3))
# Plot
par(mar=c(4.5, 4.5, 1, 1))
plot(p$x[,1], p$x[,2], xlim=c(-4.5, 4.5), ylim=c(-4.5, 4.5), xlab=paste("PCA 1 (", round(s$importance[2]*100, 1), "%)", sep = ""), ylab=paste("PCA 2 (", round(s$importance[5]*100, 1), "%)", sep = ""),
pch=pch.group, col="black", bg=col.group, cex=2.5, cex.axis=1.5, cex.lab=1.5, las=1,
panel.first= {
# Add convex hulls
polygon(tab1[ch1, ], border="darkblue", col=NA, lwd=2)
polygon(tab2[ch2, ], border="gold4", col=NA, lwd=2)
polygon(tab3[ch3, ], border="darkgreen", col=NA, lwd=2)
# Add grid lines
abline(v=0, lty=2, col="grey50")
abline(h=0, lty=2, col="grey50")
})
# Labels
text(p$x[,1], p$x[,2], labels=row.names(p$x), pos=c(1,3,4,2), font=2)
Which results in the following plot:
Showing only specific variable loadings and/or labels on the biplot
Variable loadings
In the first part of this guide, we created the biplot showing all the variable loadings as arrows. However, if you have lots of variables, this might not be practical. Or, you might simply want to show only the key variable loadings which are important for your dataset.
For example, lets create a biplot which shows only the variable loadings that have the most positive, and most negative influence on each principal component.
From the loadings plots we created earlier, it is easy to determine which variables are the most important. Another option is to use which.max()
and which.min()
. For example, the most and least influential variables on principal component 1 are:
> which.max(p$rotation[,1])
Brassicaceae undiff.
15
> which.min(p$rotation[,1])
Quercus deciduous
6
And for principal component 2:
> which.max(p$rotation[,2])
Pinus
5
> which.min(p$rotation[,2])
Phillyrea
4
R outputs both the variable name, and the row number from the $rotation
matrix. We'll use this information to build two matrices containing x and y coordinates for our variable loadings.
# Create matrix of x coordinates (PC1) and multiply by 10
l.x <- cbind(p$rotation[,1][c(15, 6, 5, 4)]) * 10
# y coordinates (PC2)
l.y <- cbind(p$rotation[,2][c(15, 6, 5, 4)]) * 10
In this code, we created two matrices of x and y coordinates for the specific variable loadings. We used the information from which.max
and which.min
to subset the data. So, p$rotation[,1][c(15, 6, 5, 4)]
will get the data contained in the 15th, 6th, 5th, and 4th cells of the 1st column in the p$rotation
matrix. We use cbind
to construct the new matrix, and we then multiplied the values by 10, so the arrows scale with our biplot.
Next, we'll add the arrows and labels to the biplot. (A more detailed explanation of this code is available in part 1 of the guide).
# Add arrows to biplot
arrows(x0=0, x1=l.x, y0=0, y1=l.y, col="red", length=0.15, lwd=3)
# Labels
text(l.x, l.y, labels=rownames(l.x) , col="red", pos=c(3, 1, 3, 1), offset=1, cex=1.2)
For these labels, we have also used offset=1
, which adjusts the position of the labels by "1" from the coordinates ("l.x" and "l.y"). This can be useful to leave extra space between the arrow head and the label.
This should result in the following plot:
Individuals (observations)
If you do not want to show labels for the individuals (observations), then you can simply omit the relevant line of code when creating the biplot. However, you may want to label some of the observations, and this can be done in much the same way as labelling specific loadings.
For example, the following code will label specific observations:
### Specific labels
n.x <- cbind(p$x[,1][c(16, 28, 33)]) # x coordinates
n.y <- cbind(p$x[,2][c(16, 28, 33)]) # y coordinates
# Add labels to biplot
text(n.x, n.y, labels=rownames(n.x), pos=c(4, 3, 3), offset=1.5, cex=1.5, font=2)
In addition to labelling specific observations, you might also want to "highlight" them. For example, you could change the plotted symbol or make the symbol bigger, or change the colour. The easiest way to do this, is to over-plot the points using the same coordinates you just extracted for the labels.
# Over-plot symbols
points(n.x, n.y, pch=c(21, 22, 24), bg=c("skyblue2", "gold", "green2"), col="deeppink", cex=4, lwd=3)
In this code, we have plotted specific observations over the existing plot. These keep the same background colours and symbol, however, they are larger, and have a different border colour, which helps them stand out on the plot. The resulting plot should look like this:
Thanks for reading this guide! As you can see from the examples above, there are loads of options for customising your PCA biplots using R's base functionality. Please leave any comments or questions below.
This is a 2 part guide:
Part 1: Principal components analysis (PCA) in R | PCA in R using base functions, and creating beautiful looking biplots. Also covers plotting 95% confidence ellipses. |
Part 2: Principal components analysis (PCA) in R | PCA in R, looking at loadings plots, convex hulls, specifying/limiting labels and/or variable arrows, and more biplot customisations. |
Further reading
Principal components analysis (PCA) in R - Part 1 of this guide for doing PCA in R using base functions, and creating beautiful looking biplots. Also covers plotting 95% confidence ellipses.
A quick guide to layout() in R - How to create multi-panel plots and figures using the layout() function. [R Graphics]
A quick guide to pch symbols - A quick guide to the different pch symbols which are available in R, and how to use them. [R Graphics]
A quick guide to line types (lty) - A quick guide to the different line types available in R, and how to use them. [R Graphics]
Hi Ben! Both of you posts on PCA plots were really usefull! Thank you for sharing.
ReplyDeleteHowever I have a question. How wolud you add to the plot with the polygons the tree with the relation of the species in order to convert the PCA into a phylomorphospace?
Thank you in advance!
Hi,
DeleteI do not know :) looks like you might need to download a package to get the plot you are after. Take a look here:
https://www.rdocumentation.org/packages/phytools/versions/0.7-80/topics/phylomorphospace
Best wishes,
Ben