Text manipulation

Many real life databases were not created for scientific or analytic purposes. In other words, they could be dirty/messy.

This page is especially useful if you need to extract or work with string/text variables.

df$concatVar <- paste(df$var1, df$var2, sep="")

I didn't tested it, but I suppose this code allows to concatenate two (or more?) variables in different data frames (df3$concatVar <- paste(df1$var1, df2$var2, sep="")).

Assume you have a string variable, and depending on the presence (or not) of some text, you want to create a new binary variable taking the value 0 or 1.

df$dummy <- as.numeric(2)
df$dummy[grepl("a specific string", df$varToProcess, fixed = TRUE)] <- as.numeric(0)
df$dummy[grepl("another specific string", df$varToProcess, fixed = TRUE)] <- as.numeric(1)

The first line populates the dummy variable with 2, in order to capture potential errors: if after running this chunk of code, dummy still has 2, it means something went probably wrong somewhere.

grepl is used to match a specific string in the variable to process (https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html). fixed = TRUE assures that this string is evaluated as text, not as a regular expression.

df$var <- gsub("string to delete", "", df$var)

http://www.endmemo.com/program/R/gsub.php

To ignore case, add ignore.case = TRUE as a parameter in gsub.

Sometimes, real life databases are messy but still kind of structured.

df$newvar <- str_extract(df$var, ".+?(?=-)")

This code will extract in newvar all the text contained in var before the first - occurs. .+?(?=-) is a regular expression.

Another example:

df$newprice <- str_extract(df$price, ".+?(?=€)")

This code will extract the strings contained before the euro symbol in the variable price, and will write it in the variable newprice.

Require stringr package.

df$col <- str_replace_all(df$col, fixed("string to replace"), "replacement string")
Enter your comment. Wiki syntax is allowed:
If you can't read the letters on the image, download this .wav file to get them read to you.
 
  • Last modified: 7 weeks ago
  • by Olivier Simard-Casanova