Many real life databases were not created for scientific or analytic purposes. In other words, they could be dirty/messy.
This page is especially useful if you need to extract or work with string/text variables.
Concatenate two string variables
df$concatVar <- paste(df$var1, df$var2, sep="")
I didn’t tested it, but I suppose this code allows to concatenate two (or more?) variables in different data frames (df3$concatVar <- paste(df1$var1, df2$var2, sep=""
)).
Create a dummy variable based on text
Assume you have a string variable, and depending on the presence (or not) of some text, you want to create a new binary variable taking the value 0 or 1.
df$dummy <- as.numeric(2)
df$dummy[grepl("a specific string", df$varToProcess, fixed = TRUE)] <- as.numeric(0)
df$dummy[grepl("another specific string", df$varToProcess, fixed = TRUE)] <- as.numeric(1)
The first line populates the dummy variable with 2
, in order to capture potential errors: if after running this chunk of code, dummy
still has 2
, it means something went probably wrong somewhere.
grepl
is used to match a specific string in the variable to process (https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html). fixed = TRUE
assures that this string is evaluated as text, not as a regular expression.
Delete everything before a string
df$var <- gsub(".*STRING", "", df$var)
STRING
can be a single character, or more.
Delete string in a variable
df$var <- gsub("string to delete", "", df$var)
To ignore case, add ignore.case = TRUE
as a parameter in gsub
.
Extract strings before a certain character occurs
Sometimes, real life databases are messy but still kind of structured.
df$newvar <- str_extract(df$var, ".+?(?=-)")
This code will extract in newvar
all the text contained in var
before the first -
occurs. .+?(?=-)
is a regular expression.
Another example:
df$newprice <- str_extract(df$price, ".+?(?=€)")
This code will extract the strings contained before the euro symbol in the variable price
, and will write it in the variable newprice
.
Extract the x first characters in a string
To extract the two first characters in the variable var
, use:
substr(var, start = 1, stop = 2)
Replace a string with another one
Require stringr
package.
df$col <- str_replace_all(df$col, fixed("string to replace"), "replacement string")