Benjamin Bell: Blog: Useful functions for manipulating text in R

Useful functions for manipulating text in R

by Ben on Thursday, March 22, 2018

R has some really cool little features to make life easier. A couple of really useful features for dealing with long text or character strings are abbreviate() and strtrim(). The first will automatically abbreviate character strings to a specified number of letters, and the second will trim a long character string to a specified number of letters. These functions can be really useful if you need to shorten text - for example, in plot axes or legends.

This quick guide will show you how to use both of these functions in R, and also take a look at paste() for further text manipulation.

Guide Information

Title	Useful functions for manipulating text in R
Author	Benjamin Bell
Published	March 22, 2018
Last updated
R version	3.4.2
Packages	base
Navigation	Abbreviate text Trim text paste() Further reading

Abbreviate text

abbreviate() is a useful feature for automatically abbreviating long text or character strings. For this guide, we'll use the built in dataset state.name, which as the name suggests is a list of all 50 states in the US. Simply type "state.name" into the R console to see:

> state.name
 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
 [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
 [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
[13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
[17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
[21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
[25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
[33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
[41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
[45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
[49] "Wisconsin"      "Wyoming"

The function has a few options which allow you to specify the minimum length of the abbreviated text, whether to remove lowercase characters first, whether to append a period at the end of the text and others. Lets do a few examples:

ab1 <- abbreviate(state.name, minlength=4, dot=TRUE)

This code will take the object "state.name", and abbreviate the character strings to 4 letters minlength=4, and append a period dot=TRUE.

Let's decrease the the abbreviation to 3 letters:

ab2 <- abbreviate(state.name, minlength=3, dot=TRUE)

And, lets also apply some more arguments:

ab3 <- abbreviate(state.name, minlength=3, dot=TRUE, strict=TRUE)

strict=TRUE will enforce the minlength argument, but may result in some duplicate abbreviations. (The default value is FALSE).

Now, let's combine all these into a matrix, so we can compare the abbreviations:

ab.m1 <- cbind(ab1, ab2, ab3)

> ab.m1
               ab1     ab2     ab3   
Alabama        "Albm." "Alb."  "Alb."
Alaska         "Alsk." "Als."  "Als."
Arizona        "Arzn." "Arz."  "Arz."
Arkansas       "Arkn." "Ark."  "Ark."
California     "Clfr." "Clf."  "Clf."
Colorado       "Clrd." "Clr."  "Clr."
Connecticut    "Cnnc." "Cnn."  "Cnn."
Delaware       "Dlwr." "Dlw."  "Dlw."
Florida        "Flrd." "Flr."  "Flr."
Georgia        "Gerg." "Grg."  "Grg."
Hawaii         "Hawa." "Haw."  "Haw."
Idaho          "Idah." "Idh."  "Idh."
Illinois       "Illn." "Ill."  "Ill."
Indiana        "Indn." "Ind."  "Ind."
Iowa           "Iowa"  "Iow."  "Iow."
Kansas         "Knss." "Kns."  "Kns."
Kentucky       "Kntc." "Knt."  "Knt."
Louisiana      "Losn." "Lsn."  "Lsn."
Maine          "Main." "Man."  "Man."
Maryland       "Mryl." "Mry."  "Mry."
Massachusetts  "Mssc." "Mssc." "Mss."
Michigan       "Mchg." "Mch."  "Mch."
Minnesota      "Mnns." "Mnn."  "Mnn."
Mississippi    "Msss." "Msss." "Mss."
Missouri       "Mssr." "Mssr." "Mss."
Montana        "Mntn." "Mnt."  "Mnt."
Nebraska       "Nbrs." "Nbr."  "Nbr."
Nevada         "Nevd." "Nvd."  "Nvd."
New Hampshire  "NwHm." "NwH."  "NwH."
New Jersey     "NwJr." "NwJ."  "NwJ."
New Mexico     "NwMx." "NwM."  "NwM."
New York       "NwYr." "NwY."  "NwY."
North Carolina "NrtC." "NrC."  "NrC."
North Dakota   "NrtD." "NrD."  "NrD."
Ohio           "Ohio"  "Ohi."  "Ohi."
Oklahoma       "Oklh." "Okl."  "Okl."
Oregon         "Orgn." "Org."  "Org."
Pennsylvania   "Pnns." "Pnn."  "Pnn."
Rhode Island   "RhdI." "RhI."  "RhI."
South Carolina "SthC." "StC."  "StC."
South Dakota   "SthD." "StD."  "StD."
Tennessee      "Tnns." "Tnn."  "Tnn."
Texas          "Texs." "Txs."  "Txs."
Utah           "Utah"  "Uth."  "Uth."
Vermont        "Vrmn." "Vrm."  "Vrm."
Virginia       "Vrgn." "Vrg."  "Vrg."
Washington     "Wshn." "Wsh."  "Wsh."
West Virginia  "WstV." "WsV."  "WsV."
Wisconsin      "Wscn." "Wsc."  "Wsc."
Wyoming        "Wymn." "Wym."  "Wym."

Compare the abbreviations produced for "ab2" and "ab3". Although both specified a minimum length of 3 characters minlength=3, "ab2" shows some abbreviations as 4 characters to avoid creating duplicates. In "ab3" where the minimum length was enforced strict=TRUE, there are duplicate abbreviations, which is not ideal.

An easy way to find the duplicates is to use the following code:

ab3[duplicated(ab3)]

In this code, we are subsetting the "ab3" vector to show values which are duplicates duplicated, which will result in the following output:

> ab3[duplicated(ab3)]
Mississippi    Missouri 
     "Mss."      "Mss."

Now, the eagle eyed among you will have spotted in the large table that Massachusetts also has the abbreviation "Mss." - so why is it not included in the above? Well, since Massachusetts is the first state in our vector to use this abbreviation, it is not considered a duplicate. Only the states which come after and have the same abbreviation are duplicates.

But, since we now know the duplicate abbreviation, there is another way to show all the states which have the same abbreviation by using which.

ab3[which(ab3 == "Mss.")]

In this code, we subset our data to show which states having the abbreviation which exactly matches "Mss.", which results in the following output:

> ab3[which(ab3 == "Mss.")]
Massachusetts   Mississippi      Missouri 
       "Mss."        "Mss."        "Mss."

Let's consider some more examples:

ab4 <- abbreviate(state.name, minlength=3, dot=TRUE, use.classes=FALSE)

use.classes is a logical argument as to whether R should first remove lowercase characters from the string when abbreviating. (The default value is TRUE).

ab5 <- abbreviate(state.name, minlength=3, dot=TRUE, method="both.sides")

method="both.sides" changes the method that R uses when it abbreviates the character string. (The default method is method="left.kept"). More information about the methodology used is available in the help page ?abbreviate

ab6 <- abbreviate(state.name, minlength=3, dot=TRUE, use.classes=FALSE, method="both.sides")

We'll combine these into a matrix (and also include ab2) for comparison, which will result in the following:

> ab.m2 <- cbind(ab2, ab4, ab5, ab6)
               ab2     ab4      ab5    ab6   
Alabama        "Alb."  "Alab."  "Alb." "Ala."
Alaska         "Als."  "Alas."  "Als." "ska."
Arizona        "Arz."  "Ari."   "Arz." "Ari."
Arkansas       "Ark."  "Ark."   "Ark." "Ark."
California     "Clf."  "Cal."   "Clf." "Cal."
Colorado       "Clr."  "Col."   "Clr." "Col."
Connecticut    "Cnn."  "Con."   "Cnn." "Con."
Delaware       "Dlw."  "Del."   "Dlw." "Del."
Florida        "Flr."  "Flo."   "Flr." "Flo."
Georgia        "Grg."  "Geo."   "Grg." "Geo."
Hawaii         "Haw."  "Haw."   "Haw." "Haw."
Idaho          "Idh."  "Ida."   "Idh." "Ida."
Illinois       "Ill."  "Ill."   "Ill." "Ill."
Indiana        "Ind."  "Ind."   "Ind." "Ind."
Iowa           "Iow."  "Iow."   "Iow." "Iow."
Kansas         "Kns."  "Kan."   "Kns." "Kan."
Kentucky       "Knt."  "Ken."   "Knt." "Ken."
Louisiana      "Lsn."  "Lou."   "Lsn." "Lou."
Maine          "Man."  "Mai."   "Man." "Mai."
Maryland       "Mry."  "Mar."   "Mry." "Mar."
Massachusetts  "Mssc." "Mas."   "Mss." "Mas."
Michigan       "Mch."  "Mic."   "Mch." "Mic."
Minnesota      "Mnn."  "Min."   "Mnn." "Min."
Mississippi    "Msss." "Missi." "Mpi." "Mis."
Missouri       "Mssr." "Misso." "Mri." "uri."
Montana        "Mnt."  "Mon."   "Mnt." "Mon."
Nebraska       "Nbr."  "Neb."   "Nbr." "Neb."
Nevada         "Nvd."  "Nev."   "Nvd." "Nev."
New Hampshire  "NwH."  "NeH."   "NwH." "NeH."
New Jersey     "NwJ."  "NeJ."   "NwJ." "NeJ."
New Mexico     "NwM."  "NeM."   "NwM." "NeM."
New York       "NwY."  "NeY."   "NwY." "NeY."
North Carolina "NrC."  "NoC."   "NrC." "NoC."
North Dakota   "NrD."  "NoD."   "NrD." "NoD."
Ohio           "Ohi."  "Ohi."   "Ohi." "Ohi."
Oklahoma       "Okl."  "Okl."   "Okl." "Okl."
Oregon         "Org."  "Ore."   "Org." "Ore."
Pennsylvania   "Pnn."  "Pen."   "Pnn." "Pen."
Rhode Island   "RhI."  "RhI."   "RhI." "RhI."
South Carolina "StC."  "SoC."   "StC." "SoC."
South Dakota   "StD."  "SoD."   "StD." "SoD."
Tennessee      "Tnn."  "Ten."   "Tnn." "Ten."
Texas          "Txs."  "Tex."   "Txs." "Tex."
Utah           "Uth."  "Uta."   "Uth." "Uta."
Vermont        "Vrm."  "Ver."   "Vrm." "Ver."
Virginia       "Vrg."  "Vir."   "Vrg." "Vir."
Washington     "Wsh."  "Was."   "Wsh." "Was."
West Virginia  "WsV."  "WeV."   "WsV." "WeV."
Wisconsin      "Wsc."  "Wis."   "Wsc." "Wis."
Wyoming        "Wym."  "Wyo."   "Wym." "Wyo."

You'll notice that many of the abbreivations are the same, even though different methods were used. But, for some states, the abbreviations are quite different:

Missouri       "Mssr." "Misso." "Mri." "uri."

Depending on your requirements, there are plenty of options for abbreviating text in R! One final argument to consider is named. This is a logical argument as to whether the original names (i.e. the states) should be returned with the vector of abbreviated names. The default value is TRUE.

Trim text

Another useful function is strtrim() which will trim a character string to a specified limit. The options are more limited compared to abbreviate(). Consider the following:

t5 <- strtrim(state.name, 5)

This code will trim the character strings within the "state.name" vector to a maximum of 5 characters. The value can be anything, for example:

t2 <- strtrim(state.name, 2)

And that is pretty much it for strtrim(). The function does not return the original character string unlike abbreviate(). But, you could compare the results by combining into a matrix:

trim <- cbind(state.name, t5, t2)

> trim
      state.name       t5      t2  
 [1,] "Alabama"        "Alaba" "Al"
 [2,] "Alaska"         "Alask" "Al"
 [3,] "Arizona"        "Arizo" "Ar"
 [4,] "Arkansas"       "Arkan" "Ar"
 [5,] "California"     "Calif" "Ca"
 [6,] "Colorado"       "Color" "Co"
 [7,] "Connecticut"    "Conne" "Co"
 [8,] "Delaware"       "Delaw" "De"
 [9,] "Florida"        "Flori" "Fl"
[10,] "Georgia"        "Georg" "Ge"
[11,] "Hawaii"         "Hawai" "Ha"
[12,] "Idaho"          "Idaho" "Id"
[13,] "Illinois"       "Illin" "Il"
[14,] "Indiana"        "India" "In"
[15,] "Iowa"           "Iowa"  "Io"
[16,] "Kansas"         "Kansa" "Ka"
[17,] "Kentucky"       "Kentu" "Ke"
[18,] "Louisiana"      "Louis" "Lo"
[19,] "Maine"          "Maine" "Ma"
[20,] "Maryland"       "Maryl" "Ma"
[21,] "Massachusetts"  "Massa" "Ma"
[22,] "Michigan"       "Michi" "Mi"
[23,] "Minnesota"      "Minne" "Mi"
[24,] "Mississippi"    "Missi" "Mi"
[25,] "Missouri"       "Misso" "Mi"
[26,] "Montana"        "Monta" "Mo"
[27,] "Nebraska"       "Nebra" "Ne"
[28,] "Nevada"         "Nevad" "Ne"
[29,] "New Hampshire"  "New H" "Ne"
[30,] "New Jersey"     "New J" "Ne"
[31,] "New Mexico"     "New M" "Ne"
[32,] "New York"       "New Y" "Ne"
[33,] "North Carolina" "North" "No"
[34,] "North Dakota"   "North" "No"
[35,] "Ohio"           "Ohio"  "Oh"
[36,] "Oklahoma"       "Oklah" "Ok"
[37,] "Oregon"         "Orego" "Or"
[38,] "Pennsylvania"   "Penns" "Pe"
[39,] "Rhode Island"   "Rhode" "Rh"
[40,] "South Carolina" "South" "So"
[41,] "South Dakota"   "South" "So"
[42,] "Tennessee"      "Tenne" "Te"
[43,] "Texas"          "Texas" "Te"
[44,] "Utah"           "Utah"  "Ut"
[45,] "Vermont"        "Vermo" "Ve"
[46,] "Virginia"       "Virgi" "Vi"
[47,] "Washington"     "Washi" "Wa"
[48,] "West Virginia"  "West " "We"
[49,] "Wisconsin"      "Wisco" "Wi"
[50,] "Wyoming"        "Wyomi" "Wy"

paste()

You can manipulate the character string further to add additional text using paste(). You saw that abbreviate() lets you add a period to the end of the abbreviation, but strtrim() does not have this option.

To add period's to the end of abbreviated text you could use the following code:

t2p <- paste(t2, ".", sep="")

In this code, you specify a vector which will appear "first", in this case "t2", then specify what will appear "after", in this case a period "." sep="" tells R not to include any additional space or character between our two vectors.

> t2p
 [1] "Al." "Al." "Ar." "Ar." "Ca." "Co." "Co." "De." "Fl." "Ge." "Ha." "Id."
[13] "Il." "In." "Io." "Ka." "Ke." "Lo." "Ma." "Ma." "Ma." "Mi." "Mi." "Mi."
[25] "Mi." "Mo." "Ne." "Ne." "Ne." "Ne." "Ne." "Ne." "No." "No." "Oh." "Ok."
[37] "Or." "Pe." "Rh." "So." "So." "Te." "Te." "Ut." "Ve." "Vi." "Wa." "We."
[49] "Wi." "Wy."

Or, you could do it the other way around:

pt2 <- paste(".", t2, sep="")

> pt2
 [1] ".Al" ".Al" ".Ar" ".Ar" ".Ca" ".Co" ".Co" ".De" ".Fl" ".Ge" ".Ha" ".Id"
[13] ".Il" ".In" ".Io" ".Ka" ".Ke" ".Lo" ".Ma" ".Ma" ".Ma" ".Mi" ".Mi" ".Mi"
[25] ".Mi" ".Mo" ".Ne" ".Ne" ".Ne" ".Ne" ".Ne" ".Ne" ".No" ".No" ".Oh" ".Ok"
[37] ".Or" ".Pe" ".Rh" ".So" ".So" ".Te" ".Te" ".Ut" ".Ve" ".Vi" ".Wa" ".We"
[49] ".Wi" ".Wy"

Let's consider a more useful example. We'll add the 3 letter state abbreviations we created earlier ("ab2") to our state names, and we'll enclose the abbreviations in parentheses.

s <- paste(state.name, " (", ab2, ")", sep="")

This code now includes 4 vectors: the state names, the opening parentheses, the abbreviations and the close parentheses. Note that the opening parentheses includes a preceeding space, to seperate the statename from the abbreviation. If you were to put the space in the sep argument instead, this would result in a space being inserted between all the vectors, which is not what we want.

> s
 [1] "Alabama (Alb.)"        "Alaska (Als.)"         "Arizona (Arz.)"       
 [4] "Arkansas (Ark.)"       "California (Clf.)"     "Colorado (Clr.)"      
 [7] "Connecticut (Cnn.)"    "Delaware (Dlw.)"       "Florida (Flr.)"       
[10] "Georgia (Grg.)"        "Hawaii (Haw.)"         "Idaho (Idh.)"         
[13] "Illinois (Ill.)"       "Indiana (Ind.)"        "Iowa (Iow.)"          
[16] "Kansas (Kns.)"         "Kentucky (Knt.)"       "Louisiana (Lsn.)"     
[19] "Maine (Man.)"          "Maryland (Mry.)"       "Massachusetts (Mssc.)"
[22] "Michigan (Mch.)"       "Minnesota (Mnn.)"      "Mississippi (Msss.)"  
[25] "Missouri (Mssr.)"      "Montana (Mnt.)"        "Nebraska (Nbr.)"      
[28] "Nevada (Nvd.)"         "New Hampshire (NwH.)"  "New Jersey (NwJ.)"    
[31] "New Mexico (NwM.)"     "New York (NwY.)"       "North Carolina (NrC.)"
[34] "North Dakota (NrD.)"   "Ohio (Ohi.)"           "Oklahoma (Okl.)"      
[37] "Oregon (Org.)"         "Pennsylvania (Pnn.)"   "Rhode Island (RhI.)"  
[40] "South Carolina (StC.)" "South Dakota (StD.)"   "Tennessee (Tnn.)"     
[43] "Texas (Txs.)"          "Utah (Uth.)"           "Vermont (Vrm.)"       
[46] "Virginia (Vrg.)"       "Washington (Wsh.)"     "West Virginia (WsV.)" 
[49] "Wisconsin (Wsc.)"      "Wyoming (Wym.)"

And that's it! Simple, but really useful functions for text manipulation to make life easier! Thanks for reading, please leave any comments or questions below.

Benjamin Bell: Blog

Pages (Desktop)

Pages (Mobile)

Useful functions for manipulating text in R

Guide Information

Abbreviate text

Trim text

paste()

Further reading

No comments

Post a Comment