Matching

How to match two databases according to two different keys (that should be true at the same time)? For instance, first name + last name.

  • Create a meta-id made of the merge of the two actual keys, and merge the databases according to this meta-id

To add more rows to a given data frame, use rbind (source. Caveat: the two data frames have to have the same columns (same number, same names). If it's not the case, delete the supplemental columns or create new columns filled with NA.

new_df <- rbdind(df1, df2)

Even though I don't see how it could be useful, cbind allows to add columns to a data frame (you want to use proper matching instead, but at least mentioning cbind explains why rbind's name starts with a “r”: it's for row bind).

How to match two databases while allowing for some errors?

Thanks to the fuzzyjoin package (GitHub), it is easy to fuzzy match stuff.

df_joined <- stringdist_inner_join(df1, df2, by = c("id"), max_dist = "1")

Only keep, in df1, all the records that are not found in df2:

stringdist_anti_join(df1, df2, by = c("id"), max_dist = "1")

For simple merging needs, use merge: manual.

Merge two similar but not identically structured data frames

Assume you have two similar data frames, but some rows only exist in one of the two data frames1). But you still want to merge them. This code does the job:

merged_df <- merge(df1, df_1, by = "var", all = TRUE)

The key part here is all = TRUE: without it, merge will delete rows that are found in just one of the two data frames.

This code will produce a merged_df with NA. You probably want to get rid of them (see how)

Let's say you want to match two databases but only for the observations that are similar according to a set of criteria (same name, same birthday, same address, etc.).

left_join from the dplyr package can help (source):

left_join(long_dataframe_name, short_dataframe_name, by= c("Age", "Gender"))

I wonder if something similar can be performed using the fuzzyjoin package (see upward).


1)
Example: you have two sets of customers, the customers of 2017 and the customers of 2018. The variable of interest is their home country. Some countries may only have customers in 2017 or only in 2018. What you want is a consolidated data frame with all the countries, to compare 2017 and 2018. But you want to include the countries where customers are only present in one of the two years.
Enter your comment. Wiki syntax is allowed:
If you can't read the letters on the image, download this .wav file to get them read to you.
 
  • Last modified: 6 weeks ago
  • by Olivier Simard-Casanova