Text manipulation

Many real life databases were not created for scientific or analytic purposes. In other words, they could be dirty/messy.

This page is especially useful if you need to extract or work with string/text variables.

Concatenate two string variables

df$concatVar <- paste(df$var1, df$var2, sep="")

I didn’t tested it, but I suppose this code allows to concatenate two (or more?) variables in different data frames (df3$concatVar <- paste(df1$var1, df2$var2, sep="")).

Create a dummy variable based on text

Assume you have a string variable, and depending on the presence (or not) of some text, you want to create a new binary variable taking the value 0 or 1.

df$dummy <- as.numeric(2)
df$dummy[grepl("a specific string", df$varToProcess, fixed = TRUE)] <- as.numeric(0)
df$dummy[grepl("another specific string", df$varToProcess, fixed = TRUE)] <- as.numeric(1)

The first line populates the dummy variable with 2, in order to capture potential errors: if after running this chunk of code, dummy still has 2, it means something went probably wrong somewhere.

grepl is used to match a specific string in the variable to process (https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html). fixed = TRUE assures that this string is evaluated as text, not as a regular expression.

Delete everything before a string

df$var <- gsub(".*STRING", "", df$var)

STRING can be a single character, or more.

Delete string in a variable

df$var <- gsub("string to delete", "", df$var)

Source.

To ignore case, add ignore.case = TRUE as a parameter in gsub.

Extract strings before a certain character occurs

Sometimes, real life databases are messy but still kind of structured.

df$newvar <- str_extract(df$var, ".+?(?=-)")

This code will extract in newvar all the text contained in var before the first - occurs. .+?(?=-) is a regular expression.

Another example:

df$newprice <- str_extract(df$price, ".+?(?=€)")

This code will extract the strings contained before the euro symbol in the variable price, and will write it in the variable newprice.

Extract the x first characters in a string

To extract the two first characters in the variable var, use:

substr(var, start = 1, stop = 2)

Source.

Replace a string with another one

Require stringr package.

df$col <- str_replace_all(df$col, fixed("string to replace"), "replacement string")