How to match two databases according to two different keys (that should be true at the same time)? For instance, first name + last name.
- Create a
meta-idmade of the merge of the two actual keys, and merge the databases according to this
To add more rows to a given data frame, use
rbind (source. Caveat: the two data frames have to have the same columns (same number, same names). If it’s not the case, delete the supplemental columns or create new columns filled with
new_df <- rbdind(df1, df2)
Even though I don’t see how it could be useful,
cbind allows to add columns to a data frame (you want to use proper matching instead, but at least mentioning
cbind explains why
rbind‘s name starts with a “r”: it’s for row bind).
See this blog post for more details on fuzzy matching.
How to match two databases while allowing for some errors?
Thanks to the
fuzzyjoin package (GitHub), it is easy to fuzzy match stuff.
df_joined <- stringdist_inner_join(df1, df2, by = c("id"), max_dist = "1")
Only keep, in
df1, all the records that are not found in
stringdist_anti_join(df1, df2, by = c("id"), max_dist = "1")
Merge two data frames
For simple merging needs, use
Merge two similar but not identically structured data frames
Assume you have two similar data frames, but some rows only exist in one of the two data frames1Example: you have two sets of customers, the customers of 2017 and the customers of 2018. The variable of interest is their home country. Some countries may only have customers in 2017 or only in 2018. What you want is a consolidated data frame with all the countries, to compare 2017 and 2018. But you want to include the countries where customers are only present in one of the two years.. But you still want to merge them. This code does the job:
merged_df <- merge(df1, df_1, by = "var", all = TRUE)
The key part here is
all = TRUE: without it,
merge will delete rows that are found in just one of the two data frames.
This code will produce a
NA. You probably want to get rid of them (see how)
Let’s say you want to match two databases but only for the observations that are similar according to a set of criteria (same name, same birthday, same address, etc.).
left_join from the
dplyr package can help (source):
left_join(long_dataframe_name, short_dataframe_name, by= c("Age", "Gender"))
I wonder if something similar can be performed using the
fuzzyjoin package (see upward).