Many real life databases were not created for scientific or analytic purposes. In other words, they could be dirty/messy.
This page is especially useful if you need to extract or work with string/text variables.
Concatenate two string variables
df$concatVar <- paste(df$var1, df$var2, sep="")
I didn’t tested it, but I suppose this code allows to concatenate two (or more?) variables in different data frames (
df3$concatVar <- paste(df1$var1, df2$var2, sep="")).
Create a dummy variable based on text
Assume you have a string variable, and depending on the presence (or not) of some text, you want to create a new binary variable taking the value 0 or 1.
df$dummy <- as.numeric(2) df$dummy[grepl("a specific string", df$varToProcess, fixed = TRUE)] <- as.numeric(0) df$dummy[grepl("another specific string", df$varToProcess, fixed = TRUE)] <- as.numeric(1)
The first line populates the dummy variable with
2, in order to capture potential errors: if after running this chunk of code,
dummy still has
2, it means something went probably wrong somewhere.
grepl is used to match a specific string in the variable to process (https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html).
fixed = TRUE assures that this string is evaluated as text, not as a regular expression.
Delete everything before a string
df$var <- gsub(".*STRING", "", df$var)
STRING can be a single character, or more.
Delete string in a variable
df$var <- gsub("string to delete", "", df$var)
To ignore case, add
ignore.case = TRUE as a parameter in
Extract strings before a certain character occurs
Sometimes, real life databases are messy but still kind of structured.
df$newvar <- str_extract(df$var, ".+?(?=-)")
This code will extract in
newvar all the text contained in
var before the first
.+?(?=-) is a regular expression.
df$newprice <- str_extract(df$price, ".+?(?=€)")
This code will extract the strings contained before the euro symbol in the variable
price, and will write it in the variable
Extract the x first characters in a string
To extract the two first characters in the variable
substr(var, start = 1, stop = 2)
Replace a string with another one
df$col <- str_replace_all(df$col, fixed("string to replace"), "replacement string")