# Text manipulation

Many real life databases were not created for scientific or analytic purposes. In other words, they could be dirty/messy.

This page is especially useful if you need to extract or work with string/text variables.

df$concatVar <- paste(df$var1, df$var2, sep="")  I didn't tested it, but I suppose this code allows to concatenate two (or more?) variables in different data frames (df3$concatVar <- paste(df1$var1, df2$var2, sep="")).

Assume you have a string variable, and depending on the presence (or not) of some text, you want to create a new binary variable taking the value 0 or 1.

df$dummy <- as.numeric(2) df$dummy[grepl("a specific string", df$varToProcess, fixed = TRUE)] <- as.numeric(0) df$dummy[grepl("another specific string", df$varToProcess, fixed = TRUE)] <- as.numeric(1)  The first line populates the dummy variable with 2, in order to capture potential errors: if after running this chunk of code, dummy still has 2, it means something went probably wrong somewhere. grepl is used to match a specific string in the variable to process (https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html). fixed = TRUE assures that this string is evaluated as text, not as a regular expression. df$var <- gsub(".*STRING", "", df$var)  STRING can be a single character, or more. df$var <- gsub("string to delete", "", df$var)  To ignore case, add ignore.case = TRUE as a parameter in gsub. Sometimes, real life databases are messy but still kind of structured. df$newvar <- str_extract(df$var, ".+?(?=-)")  This code will extract in newvar all the text contained in var before the first - occurs. .+?(?=-) is a regular expression. Another example: df$newprice <- str_extract(df$price, ".+?(?=€)")  This code will extract the strings contained before the euro symbol in the variable price, and will write it in the variable newprice. Require stringr package. df$col <- str_replace_all(df\$col, fixed("string to replace"), "replacement string")

Enter your comment. Wiki syntax is allowed: