Categories

# How to match two databases without a common identifier: a fault tolerant solution with R

## What is the problem?

Let’s assume you have two databases df1 and df2 you want to match in df_merged. But you do not have a common key between the two database, to link observations that exist in both databases.

R can provide a solution to this issue, with an implementation that is fault tolerant.

## Create a common key

If there is no common key between the two databases, the first step is to create one.

The goal is to create a common key in the two databases, and the key for each observation has to be similar between databases1To match the same observation in both databases. but unique within database2To avoid matching a given observation in df1 with several observations contained in df2, or conversely.. The way of doing it is to create a new variable key from data contained in both datasets.

For instance, if you want to match databases where observations are people, this key could be the concatenation of the first name and the last name. If the number of people is quite large, maybe you want to add another information like the date of birth to avoid key duplicates – because the larger the number of observations in your database, the higher the likelihood that two people have the same first name and last name. You want to be sure that each key is unique enough3This is especially needed considering we will use a fault-tolerant matching algorithm, see section 2 of the post. (“unique within database”).

The keys also need to have the same structure in both databases (“similar between databases”). For instance, it could be the family name first, the first name second, and the birthdate last in df1 and in df2.

Here is a simple example with the first name and the last name in df1 and df2, with no character between the two concatenated variables:

df1$key <- paste(df1$first_name, df1$last_name, sep="") df2$key <- paste(df2$first_name, df2$last_name, sep="")


To reduce the risk to miss observations that are in fact the same, it is better to have the same formatting among all the keys. I usually go with a lowercase, no space, and no hyphens4Look at my own name: it should be written Simard-Casanova, but sometimes it is written Simard Casanova. Not taking care of this difference mean I would have two different keys in df1 and df2although I am the same person in both databases. key.

For df1, unifying the formatting is done with this code:

df1$key <- str_replace_all(df1$key, fixed(" "), "")
df1$key <- str_replace_all(df1$key, fixed("-"), "")
df1$key <- tolower(df1$key)


str_replace_all requires the package stringr (CRAN).

That being said, even thought we were careful while creating the keys, we cannot be 100% sure an observation in both databases has the same key in df1 and df2. Typos, misspellings, character encoding errors, etc. are a real thing!

The problem is you may not match two observations that are actually the same. Fuzzy matching helps to increase the number of matches by introducing fault tolerance.

## (Fuzzy) Match the two databases

To be completely transparent, I am by no mean a specialist of fuzzy matching. I just know that it worked for me, but you should be really careful with the code that follows.

Basically, what fuzzy matching does is simple: it merges two databases based on a given variable (here, the keys we created earlier), and the comparison of the variables is to a certain extent fault tolerant.

To match two observations, the keys need to be close enough instead of being perfectly identical. It means that on top of exactly similar keys, similar but not exactly similar keys will also lead to the merger of the corresponding observation in the final, matched database.

The code is as follow:

df_matched <- stringdist_inner_join(df1, df2, by = c("key"), max_dist = "1")


Pretty simple, isn’t it?

stringdist_inner_join requires the package fuzzyjoin (CRAN).

To be clear, df_matched contains only the observations with a somewhat similar key in both df1 and df2. All the observations that only exist in one of the database are dropped. max_dist = "1" is a conservative choice, in the sense that only observations with similar enough keys are matched.

With this pretty simple code, you can now match two (or more!) databases that have not a common identifier, and with some kind of fault tolerance!

Feel free to share improvements and ideas in the comments!

## Full code

Here is the whole code in one single chunk:

# Load the required packages
library("stringr") # for "str_replace_all"
library("fuzzyjoin") # for "stringdist_inner_join"

# Create the key in both database
df1$key <- paste(df1$first_name, df1$last_name, sep="") df2$key <- paste(df2$first_name, df2$last_name, sep="")

# Unify the formatting of the keys in the first database
df1$key <- str_replace_all(df1$key, fixed(" "), "")
df1$key <- str_replace_all(df1$key, fixed("-"), "")
df1$key <- tolower(df1$key)

# Unify the formatting of the keys in the second database
df2$key <- str_replace_all(df2$key, fixed(" "), "")
df2$key <- str_replace_all(d21$key, fixed("-"), "")
df2$key <- tolower(df2$key)

# Merge the databases
df_matched <- stringdist_inner_join(df1, df2, by = c("key"), max_dist = "1")