What is the problem?
Let’s assume you have two databases
df2 you want to match in
df_merged. But you do not have a common key between the two database, to link observations that exist in both databases.
R can provide a solution to this issue, with an implementation that is fault tolerant.
Create a common key
If there is no common key between the two databases, the first step is to create one.
The goal is to create a common key in the two databases, and the key for each observation has to be similar between databases1To match the same observation in both databases. but unique within database2To avoid matching a given observation in
df1 with several observations contained in
df2, or conversely.. The way of doing it is to create a new variable
key from data contained in both datasets.
For instance, if you want to match databases where observations are people, this key could be the concatenation of the first name and the last name. If the number of people is quite large, maybe you want to add another information like the date of birth to avoid key duplicates – because the larger the number of observations in your database, the higher the likelihood that two people have the same first name and last name. You want to be sure that each key is unique enough3This is especially needed considering we will use a fault-tolerant matching algorithm, see section 2 of the post. (“unique within database”).
The keys also need to have the same structure in both databases (“similar between databases”). For instance, it could be the family name first, the first name second, and the birthdate last in
df1 and in
Here is a simple example with the first name and the last name in
df2, with no character between the two concatenated variables:
df1$key <- paste(df1$first_name, df1$last_name, sep="") df2$key <- paste(df2$first_name, df2$last_name, sep="")
To reduce the risk to miss observations that are in fact the same, it is better to have the same formatting among all the keys. I usually go with a lowercase, no space, and no hyphens4Look at my own name: it should be written
Simard-Casanova, but sometimes it is written
Simard Casanova. Not taking care of this difference mean I would have two different keys in
df2although I am the same person in both databases. key.
df1, unifying the formatting is done with this code:
df1$key <- str_replace_all(df1$key, fixed(" "), "") df1$key <- str_replace_all(df1$key, fixed("-"), "") df1$key <- tolower(df1$key)
str_replace_all requires the package
That being said, even thought we were careful while creating the keys, we cannot be 100% sure an observation in both databases has the same key in
df2. Typos, misspellings, character encoding errors, etc. are a real thing!
The problem is you may not match two observations that are actually the same. Fuzzy matching helps to increase the number of matches by introducing fault tolerance.
(Fuzzy) Match the two databases
To be completely transparent, I am by no mean a specialist of fuzzy matching. I just know that it worked for me, but you should be really careful with the code that follows.
Basically, what fuzzy matching does is simple: it merges two databases based on a given variable (here, the keys we created earlier), and the comparison of the variables is to a certain extent fault tolerant.
To match two observations, the keys need to be close enough instead of being perfectly identical. It means that on top of exactly similar keys, similar but not exactly similar keys will also lead to the merger of the corresponding observation in the final, matched database.
The code is as follow:
df_matched <- stringdist_inner_join(df1, df2, by = c("key"), max_dist = "1")
Pretty simple, isn’t it?
stringdist_inner_join requires the package
To be clear,
df_matched contains only the observations with a somewhat similar key in both
df2. All the observations that only exist in one of the database are dropped.
max_dist = "1" is a conservative choice, in the sense that only observations with similar enough keys are matched.
With this pretty simple code, you can now match two (or more!) databases that have not a common identifier, and with some kind of fault tolerance!
Feel free to share improvements and ideas in the comments!
Here is the whole code in one single chunk:
# Load the required packages library("stringr") # for "str_replace_all" library("fuzzyjoin") # for "stringdist_inner_join" # Create the key in both database df1$key <- paste(df1$first_name, df1$last_name, sep="") df2$key <- paste(df2$first_name, df2$last_name, sep="") # Unify the formatting of the keys in the first database df1$key <- str_replace_all(df1$key, fixed(" "), "") df1$key <- str_replace_all(df1$key, fixed("-"), "") df1$key <- tolower(df1$key) # Unify the formatting of the keys in the second database df2$key <- str_replace_all(df2$key, fixed(" "), "") df2$key <- str_replace_all(d21$key, fixed("-"), "") df2$key <- tolower(df2$key) # Merge the databases df_matched <- stringdist_inner_join(df1, df2, by = c("key"), max_dist = "1")