Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
economics:r:text-manipulation [2018/10/24 17:01]
Olivier Simard-Casanova ↷ Page name changed from r:manipulate-text to r:text-manipulation
— (current)
Line 1: Line 1:
-# Manipulate text 
  
-Many real life databases were not created for scientific or analytic purposes. In other words, they could be dirty/​messy. 
- 
-This page is especially useful if you need to extract or work with string/text variables. 
- 
-## Concatenate two string variables 
- 
-<​code:​r>​ 
-df$concatVar <- paste(df$var1,​ df$var2, sep=""​) 
-</​code>​ 
- 
-I didn't tested it, but I suppose this code allows to concatenate two (or more?) variables in different data frames (`df3$concatVar <- paste(df1$var1,​ df2$var2, sep=""​)`). 
- 
-## Create a dummy variable based on text 
- 
-Assume you have a string variable, and depending on the presence (or not) of some text, you want to create a new binary variable taking the value 0 or 1. 
- 
-<​code:​r>​ 
-df$dummy <- as.numeric(2) 
-df$dummy[grepl("​a specific string",​ df$varToProcess,​ fixed = TRUE)] <- as.numeric(0) 
-df$dummy[grepl("​another specific string",​ df$varToProcess,​ fixed = TRUE)] <- as.numeric(1) 
-</​code>​ 
- 
-The first line populates the dummy variable with `2`, in order to capture potential errors: if after running this chunk of code, `dummy` still has `2`, it means something went probably wrong somewhere. 
- 
-`grepl` is used to match a specific string in the variable to process (https://​stat.ethz.ch/​R-manual/​R-devel/​library/​base/​html/​grep.html). `fixed = TRUE` assures that this string is evaluated as text, not as a regular expression. 
- 
-## Delete string in a variable 
- 
-<​code:​R>​ 
-df$var <- gsub("​string to delete",​ "",​ df$var) 
-</​code>​ 
- 
-http://​www.endmemo.com/​program/​R/​gsub.php 
- 
-To ignore case, add `ignore.case = TRUE` as a parameter in `gsub`. 
- 
-## Extract strings before a certain character occurs 
- 
-Sometimes, real life databases are messy but still kind of structured. 
- 
-<​code:​R>​ 
-df$newvar <- str_extract(df$var,​ "​.+?​(?​=-)"​) 
-</​code>​ 
- 
-This code will extract in `newvar` all the text contained in `var` before the first `-` occurs. `.+?(?=-)` is a regular expression. 
- 
-Another example: 
- 
-<​code:​R>​ 
-df$newprice <- str_extract(df$price,​ "​.+?​(?​=€)"​) 
-</​code>​ 
- 
-This code will extract the strings contained before the euro symbol in the variable `price`, and will write it in the variable `newprice`. 
- 
-## Replace a string with another one 
- 
-Require `stringr` package. 
- 
-<​code:​r>​ 
-df$col <- str_replace_all(df$col,​ fixed("​string to replace"​),​ "​replacement string"​) 
-</​code>​ 
  • Last modified: 11 months ago
  • by Olivier Simard-Casanova