Principal Components Analysis (PCA) in R - Part 2

In the second part of my guide for principal components analysis (PCA) in R, I additionally cover loadings plots, adding convex hulls to your biplots, more customisation options, and show you some more examples of PCA biplots created using R's base functionality...


Never used R before? Check out my getting started with R guide first.
© 2018 Benjamin Bell. All Rights Reserved. https://www.benjaminbell.co.uk

Principal Components Analysis (PCA) in R - Part 2

! This guide was written using R version 3.4.2 on Windows 10.

Part 1 of this guide showed you how to do principal components analysis (PCA) in R, using the prcomp() function, and how to create a beautiful looking biplot using R's base functionality. If you missed the first part of this guide, check it out here.

The second part of this guide covers loadings plots and adding convex hulls to the biplot, as well as showing some additional customisation options for the PCA biplot. Ideally, you should have read part 1 to follow this guide, or you should already be familiar with the prcomp() function and the various options for plot(). This guide will also make use of the dataset used in part 1. To download the dataset, check out the instructions in part 1 of this guide.

© 2018 Benjamin Bell. All Rights Reserved. https://www.benjaminbell.co.uk

Loadings plots

You may want to plot the variable loadings from your PCA as a bar chart to help determine which variables are having the most influence on the principal component. If using the prcomp() function to carry out the PCA, the loadings are contained in the PCA object as a matrix: $rotation.

Using the pollen dataset from part 1, we'll firstly recreate the PCA. (See part 1 for a more detailed explanation of this code).

# Import data
ma.pollen.raw <- read.csv("Morocco Pollen Surface Sample Data.csv", header=TRUE, row.names=1, sep=",", check.names=FALSE)

# Tidy/subset imported data
ma.pollen <- ma.pollen.raw[-1:-4] # Remove the first four columns
ma.pollen <- ma.pollen[-49:-51] # Remove columns
# Remove columns where total abundanace is less than 10%
ma.sum <- colSums(ma.pollen) # Calculate column sums
ma.pollen1 <- ma.pollen[, which(ma.sum > 10)] # Subset data

# PCA using base function - prcomp()
p <- prcomp(ma.pollen1, scale=TRUE)
s <- summary(p) # Summary

So in this example, the loadings for the variables for our first principal component (PC) are p$rotation[,1], while the second principal component loadings are p$rotation[,2] and so on.

To plot the variable loadings for PC 1, using default plot options, you could use the following code:

# Loadings plot
barplot(p$rotation[,1], main="PC 1 Loadings Plot", las=2)

Which will result in a simple bar plot:

Simple loadings barplot Simple loadings barplot https://www.benjaminbell.co.uk Can't see the image? You need to upgrade your browser. Cedrus Cupressaceae Olea Phillyrea Pinus Quercus deciduous Quercus evergreen Ephedra distachya Ephedra fragilis Artemisia Asteraceae aster type Asteraceae cirsium type Asteraceae lactuceae type Asteraceae undiff. Brassicaceae undiff. Caryophyllaceae herniaria type Caryophyllaceae undiff. Chenopodiaceae undiff. Fabaceae astragalus type Fabaceae genista Fabaceae lotus type Fabaceae undiff. Helianthemum Lamiaceae Plantago undiff. Poaceae undiff. Zygophyllaceae PC 1 Loadings Plot -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

From this simple plot it is quite easy to see which variables are having a positive and negative influence on the principal component. But, there are several ways to improve the appearance of this plot. However, If you're happy with the default plot, feel free to skip the next part and jump down to adding convex hulls to a PCA biplot.

Firstly, lets deal with the axis labels. The variable names are quite long, so they do not fit on the figure when using the default options. The easiest way to fix this is to either make the bottom margin bigger e.g. par(mar=c(8, 3, 3, 1)) or make the names smaller e.g. cex.names=0.6, or a combination of both.

Another way is to plot the variable names directly underneath the bars in the bar chart:

n.pc1 <- ifelse(p$rotation[,1] > 0, yes=-0.01, no=p$rotation[,1]-0.01)

In this code, we create a numeric object ("n.pc1") containing values which will position text underneath the bars in the bar chart using the ifelse() function. The first argument p$rotation[,1] > 0 sets the condition or test for ifelse() to meet. The yes= argument will return the value specified if the condition is met, while the no= argument will return the value specified if it fails to meet the condition.

So, in this example we are telling R, if the value of the cell in the p$rotation matrix is greater than 0, it should return a value of "-0.01", otherwise it should return the value of the cell, minus 0.01. Type "n.pc1" into the R console to see the results:

> n.pc1
                        Cedrus                   Cupressaceae                           Olea 
                    -0.2861539                     -0.0100000                     -0.0100000 
                     Phillyrea                          Pinus              Quercus deciduous 
                    -0.1590976                     -0.0100000                     -0.3279861 
... omitted ...

We can also change the colour of the bars automatically to reflect whether they are having a positive or negative influence.

c.pc1 <- ifelse(p$rotation[,1] > 0, yes="green2", no="red2")

This code works in much the same way as the previous example, except it replaces values with a colour instead of number.

Now, lets re-plot the loading scores using the following code:

par(mar=c(8,3,2,1)) # Set margins
b1 <- barplot(p$rotation[,1], main="PC 1 Loadings Plot", col=c.pc1, las=2, axisnames=FALSE)
abline(h=0) # Add horizontal line
text(x=b1, y=n.pc1, labels=names(p$rotation[,1]), adj=1, srt=90, xpd=TRUE) # Add variable names

In this code, since we are going to add the variable names manually, we use axisnames=FALSE to stop barplot() plotting them automatically. We also create the bar chart as an object so that we can extract the "midpoints" of each bar to correctly position the variable names when using the text() function.

adj=1 sets the alignment of the variable names to right align.

srt=90 changes the direction of the text to a 90 degree angle (vertical).

xpd=TRUE tells R that it can plot the text outside the plot region, and within the figure region. For more on "regions" and graphic devices, check out my guide to layout().

The resulting plot should look a lot better!

Better loadings barplot Better loadings barplot, with different colour bars depending on whether negative or positive influence https://www.benjaminbell.co.uk Can't see the image? You need to upgrade your browser. PC 1 Loadings Plot -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 Cedrus Cupressaceae Olea Phillyrea Pinus Quercus deciduous Quercus evergreen Ephedra distachya Ephedra fragilis Artemisia Asteraceae aster type Asteraceae cirsium type Asteraceae lactuceae type Asteraceae undiff. Brassicaceae undiff. Caryophyllaceae herniaria type Caryophyllaceae undiff. Chenopodiaceae undiff. Fabaceae astragalus type Fabaceae genista Fabaceae lotus type Fabaceae undiff. Helianthemum Lamiaceae Plantago undiff. Poaceae undiff. Zygophyllaceae

You can plot the second principal component in much the same way as the first. The following code will plot them both together:

# Change colour of bar plot
c.pc1 <- ifelse(p$rotation[,1] > 0, yes="green2", no="red2")
c.pc2 <- ifelse(p$rotation[,2] > 0, "green2", "red2")
# Get position for variable names
n.pc1 <- ifelse(p$rotation[,1] > 0, -0.01, p$rotation[,1]-0.01)
n.pc2 <- ifelse(p$rotation[,2] > 0, -0.01, p$rotation[,2]-0.01)
# Plot
layout(matrix(1:2, ncol=1)) # Set up layout
par(mar=c(1,3,2,1), oma=c(7.5,0,0,0)) # Set up margins
# Plot PC 1
b1 <- barplot(p$rotation[,1], main="PC 1 Loadings Plot", col=c.pc1, las=2, axisnames=FALSE)
abline(h=0)
# Plot PC 2
b2 <- barplot(p$rotation[,2], main="PC 2 Loadings Plot", col=c.pc2, las=2, axisnames=FALSE)
abline(h=0)
# Add variable names
text(x=b2, y=ifelse(p$rotation[,2] > 0, -0.01, p$rotation[,2]-0.01), labels=names(p$rotation[,2]), adj=1, srt=90, xpd=NA)

There are a couple of differences in this code to the previous example. First, in addition to setting the plot margins, we also set the "outer" margins par(oma=c(x, x, x, x)) for the figure. The reason for this, is to allow enough space for the variable names along the bottom axis of the second plot, while keeping the proportions of the two plots the same.

Compare the two figures below which show the plot (red), figure (blue) and device (green) regions. If we omit the oma= argument and instead define a larger bottom border for the second plot, it will appear "squashed" compared to the first plot (left figure). By using oma=, we can set a larger bottom margin while keeping the two plots the same size (right figure).

https://www.benjaminbell.co.uk Can't see the image? You need to upgrade your browser. 1 2  vs.  https://www.benjaminbell.co.uk Can't see the image? You need to upgrade your browser. 1 2

The other code difference is in the text() function. Here, we set xpd=NA instead of TRUE, which will allow text to be plotted within the device region.

The resulting plot should look like this:

Loadings barplot for PC 1 and PC 2 Loadings barplot for PC 1 and PC 2 https://www.benjaminbell.co.uk Can't see the image? You need to upgrade your browser. PC 1 Loadings Plot -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 PC 2 Loadings Plot -0.3 -0.2 -0.1 0.0 0.1 0.2 Cedrus Cupressaceae Olea Phillyrea Pinus Quercus deciduous Quercus evergreen Ephedra distachya Ephedra fragilis Artemisia Asteraceae aster type Asteraceae cirsium type Asteraceae lactuceae type Asteraceae undiff. Brassicaceae undiff. Caryophyllaceae herniaria type Caryophyllaceae undiff. Chenopodiaceae undiff. Fabaceae astragalus type Fabaceae genista Fabaceae lotus type Fabaceae undiff. Helianthemum Lamiaceae Plantago undiff. Poaceae undiff. Zygophyllaceae
Ad

© 2018 Benjamin Bell. All Rights Reserved. https://www.benjaminbell.co.uk

Convex hulls

In part 1 of the guide, we added 95% confidence ellipses to our biplot for each group. Another option is to display convex hulls around each group of variables in the biplot.

To create the convex hull, you need to get the coordinates of the outer most individuals (observations) for each group, using the chull() function. There are three groups in the pollen dataset, so we first need to create separate matrices containing the coordinates for each group.

# Get individuals (observations), and create group matrices
tab <- matrix(c(p$x[,1], p$x[,2]), ncol=2)
tab1 <- tab[1:16,]
tab2 <- tab[17:30,]
tab3 <- tab[31:33,]

Next, we'll get the outer most individuals for each group.

# Get the outer most individuals
ch1 <- chull(tab1) 
ch2 <- chull(tab2)
ch3 <- chull(tab3)

To plot the convex hull on the biplot, use polygon(), and set the colour as col=NA and additionally define a border colour using border=. This will stop the polygon from being "filled", and only the border will be drawn.

So, for the pollen dataset PCA, the following code will plot the biplot and add convex hulls. (A more detailed explanation of this code is available in part 1 of this guide).

# Create groups
pch.group <- c(rep(21, times=16), rep(22, times=14), rep(24, times=3))
col.group <- c(rep("skyblue2", times=16), rep("gold", times=14), rep("green2", times=3))
# Plot
par(mar=c(4.5, 4.5, 1, 1))
plot(p$x[,1], p$x[,2], xlim=c(-4.5, 4.5), ylim=c(-4.5, 4.5), xlab=paste("PCA 1 (", round(s$importance[2]*100, 1), "%)", sep = ""), ylab=paste("PCA 2 (", round(s$importance[5]*100, 1), "%)", sep = ""), 
pch=pch.group, col="black", bg=col.group, cex=2.5, cex.axis=1.5, cex.lab=1.5, las=1,
panel.first= {
# Add convex hulls
polygon(tab1[ch1, ], border="darkblue", col=NA, lwd=2)
polygon(tab2[ch2, ], border="gold4", col=NA, lwd=2)
polygon(tab3[ch3, ], border="darkgreen", col=NA, lwd=2)
# Add grid lines
abline(v=0, lty=2, col="grey50") 
abline(h=0, lty=2, col="grey50")
})
# Labels
text(p$x[,1], p$x[,2], labels=row.names(p$x), pos=c(1,3,4,2), font=2)

Which results in the following plot:

PCA biplot Beautiful PCA biplot with convex hulls https://www.benjaminbell.co.uk Can't see the image? You need to upgrade your browser. -4 -2 0 2 4 -4 -2 0 2 4 PCA 1 (22.8%) PCA 2 (13.8%) TIS 1 TIS 2 TIS 3 TIS 4 TIS 5 TIS 6 TIS 7 TIS 8 TIS 9 TIS 10 TIS 11 TIS 12 TIS 13 TIS 14 TIS 15 TIS 16 ALI 1 ALI 2 ALI 3 ALI 4 ALI 5 ALI 6 ALI 7 ALI 8 ALI 9 ALI 10 ALI 11 ALI 12 ALI 13 ALI 14 MICH 01 MICH 02 MICH 03 https://www.benjaminbell.co.uk
© 2018 Benjamin Bell. All Rights Reserved. https://www.benjaminbell.co.uk

Showing only specific variable loadings and/or labels on the biplot

Variable loadings

In the first part of this guide, we created the biplot showing all the variable loadings as arrows. However, if you have lots of variables, this might not be practical. Or, you might simply want to show only the key variable loadings which are important for your dataset.

For example, lets create a biplot which shows only the variable loadings that have the most positive, and most negative influence on each principal component.

From the loadings plots we created earlier, it is easy to determine which variables are the most important. Another option is to use which.max() and which.min(). For example, the most and least influential variables on principal component 1 are:

> which.max(p$rotation[,1])
Brassicaceae undiff. 
                  15
> which.min(p$rotation[,1])
Quercus deciduous 
                6

And for principal component 2:

> which.max(p$rotation[,2])
Pinus 
    5
> which.min(p$rotation[,2])
Phillyrea 
        4

R outputs both the variable name, and the row number from the $rotation matrix. We'll use this information to build two matrices containing x and y coordinates for our variable loadings.

# Create matrix of x coordinates (PC1) and multiply by 10
l.x <- cbind(p$rotation[,1][c(15, 6, 5, 4)]) * 10
# y coordinates (PC2)
l.y <- cbind(p$rotation[,2][c(15, 6, 5, 4)]) * 10

In this code, we created two matrices of x and y coordinates for the specific variable loadings. We used the information from which.max and which.min to subset the data. So, p$rotation[,1][c(15, 6, 5, 4)] will get the data contained in the 15th, 6th, 5th, and 4th cells of the 1st column in the p$rotation matrix. We use cbind to construct the new matrix, and we then multiplied the values by 10, so the arrows scale with our biplot.

Next, we'll add the arrows and labels to the biplot. (A more detailed explanation of this code is available in part 1 of the guide).

# Add arrows to biplot
arrows(x0=0, x1=l.x, y0=0, y1=l.y, col="red", length=0.15, lwd=3)
# Labels
text(l.x, l.y, labels=rownames(l.x) , col="red", pos=c(3, 1, 3, 1), offset=1, cex=1.2)

For these labels, we have also used offset=1, which adjusts the position of the labels by "1" from the coordinates ("l.x" and "l.y"). This can be useful to leave extra space between the arrow head and the label.

This should result in the following plot:

PCA biplot Beautiful PCA biplot with convex hulls and select variable loadings shown. https://www.benjaminbell.co.uk Can't see the image? You need to upgrade your browser. -4 -2 0 2 4 -4 -2 0 2 4 PCA 1 (22.8%) PCA 2 (13.8%) TIS 1 TIS 2 TIS 3 TIS 4 TIS 5 TIS 6 TIS 7 TIS 8 TIS 9 TIS 10 TIS 11 TIS 12 TIS 13 TIS 14 TIS 15 TIS 16 ALI 1 ALI 2 ALI 3 ALI 4 ALI 5 ALI 6 ALI 7 ALI 8 ALI 9 ALI 10 ALI 11 ALI 12 ALI 13 ALI 14 MICH 01 MICH 02 MICH 03 https://www.benjaminbell.co.uk Brassicaceae undiff. Quercus deciduous Pinus Phillyrea

Individuals (observations)

If you do not want to show labels for the individuals (observations), then you can simply omit the relevant line of code when creating the biplot. However, you may want to label some of the observations, and this can be done in much the same way as labelling specific loadings.

For example, the following code will label specific observations:

### Specific labels
n.x <- cbind(p$x[,1][c(16, 28, 33)]) # x coordinates
n.y <- cbind(p$x[,2][c(16, 28, 33)]) # y coordinates
# Add labels to biplot
text(n.x, n.y, labels=rownames(n.x), pos=c(4, 3, 3), offset=1.5, cex=1.5, font=2)

In addition to labelling specific observations, you might also want to "highlight" them. For example, you could change the plotted symbol or make the symbol bigger, or change the colour. The easiest way to do this, is to over-plot the points using the same coordinates you just extracted for the labels.

# Over-plot symbols
points(n.x, n.y, pch=c(21, 22, 24), bg=c("skyblue2", "gold", "green2"), col="deeppink", cex=4, lwd=3)

In this code, we have plotted specific observations over the existing plot. These keep the same background colours and symbol, however, they are larger, and have a different border colour, which helps them stand out on the plot. The resulting plot should look like this:

PCA biplot Beautiful PCA biplot with convex hulls, and select variable loadings, and select observation labels. https://www.benjaminbell.co.uk Can't see the image? You need to upgrade your browser. -4 -2 0 2 4 -4 -2 0 2 4 PCA 1 (22.8%) PCA 2 (13.8%) TIS 16 ALI 12 MICH 03 https://www.benjaminbell.co.uk Brassicaceae undiff. Quercus deciduous Pinus Phillyrea

Thanks for reading this guide! As you can see from the examples above, there are loads of options for customising your PCA biplots using R's base functionality. Please leave any comments or questions below.


Ad

© 2018 Benjamin Bell. All Rights Reserved. https://www.benjaminbell.co.uk

Further reading

Principal components analysis (PCA) in R - Part 1 of this guide for doing PCA in R using base functions, and creating beautiful looking biplots. Also covers plotting 95% confidence ellipses.

A quick guide to layout() in R - How to create multi-panel plots and figures using the layout() function. [R Graphics]

A quick guide to pch symbols - A quick guide to the different pch symbols which are available in R, and how to use them. [R Graphics]

A quick guide to line types (lty) - A quick guide to the different line types available in R, and how to use them. [R Graphics]


No comments:

Post a Comment