Fuzzy join strings on multiple columns in one dataset

Question

I want to fuzzy match one column (df2$brands) to many other colums (df1$F6_1:f6_12) containing the same strings with some small spelling errors.

I have two datasets:

df1:

df1 <- structure(list(F6_1 = c("Braand1", "Brand2", "Brand3", "Brand4", "Brand4", 
"Brand5", "Brand6", "Brand7", "Brand6", "Brand8"), F6_2 = c("Brand9", 
"", "Brand4", "Brando6", "Brand6", "Brand8", "Brannd4", "Brandd8", 
"Brand6", "Brand6"), F6_3 = c("Brand6", "", "Brand6", 
"Brand10", "Brand10", "", "Brand8", "Brand10", "Brand8", "Brand3"
), F6_4 = c("", "", "Brand10", "", "Brand3", "", "Brand6", "Brand6", 
"Bramd3", "BPand3"), F6_5 = c("", "", "", "", "Brand6", 
"", "Brand1", "Brand1", "", "Brand1"), F6_6 = c("", 
"", "", "", "Brand6", "", "Brand3", "", "", "Brand1"), F6_7 = c("", 
"", "", "", "Brand1", "", "Brand1", "", "", "Brand1"), F6_8 = c("", 
"", "", "", "Brand1", "", "", "", "", "Brand6"
), F6_9 = c("", "", "", "", "Brrandu3", "", "", "", "", ""), F6_10 = c("", 
"", "", "", "Brand6", "", "", "", "", ""), F6_11 = c("", 
"", "", "", "Brand6", "", "", "", "", ""), F6_12 = c("", "", 
"", "", "Brand6", "", "", "", "", "")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

df2:

df2 <- structure(list(brands = c("Brand1", "Brand2", "Brand3", "Brand4", "Brand5", 
"Brand6")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", 
"data.frame"))

I tried to use the stringdist_left_join() function from the fuzzyjoin library which works perfectly.

library(tidyverse)
library(fuzzyjoin)

df1_F6_1 <- df1 %>% select(F6_1)
df2_F6_1 <- df2 %>% select(F6_1 = brands)

df_joined_F6_1 <- stringdist_left_join(df_F6_1, df2_F6_1, by = "F6_1", method = "soundex")

This works only for one column. However, I want to do this on the complete df1 dataset. This could be solved by fuzzy-joining every single column and finally add them all together. But there should be an easier more convenient way to do this.

My output should look like this:

df3 <- structure(list(F6_1 = c("Braand1", "Brand2", "Brand3", "Brand4", 
"Brand4", "Brand5", "Brand6", "Brand7", "Brand6", "Brand8"), 
    F6_1_a = c("Brand1", "Brand2", "Brand3", "Brand4", "Brand4", 
    "Brand5", "Brand6", "Brand7", "Brand6", "Brand8"), F6_2 = c("Brand9", 
    NA, "Brand4", "Brando6", "Brand6", "Brand8", "Brannd4", "Brandd8", 
    "Brand6", "Brand6"), F6_2_a = c("Brand9", NA, "Brand4", "Brand6", 
    "Brand6", "Brand8", "Brand4", "Brand8", "Brand6", "Brand6"
    ), F6_3 = c("Brand6", NA, "Brand6", "Brand10", "Brand10", 
    "Brand8", "Brand8", "Brand10", "Brand8", "Brand3"), F6_3_a = c("Brand6", 
    NA, "Brand6", "Brand10", "Brand10", "Brand8", "Brand8", "Brand10", 
    "Brand8", "Brand3"), F6_4 = c(NA, NA, "Brand10", NA, "Brand3", 
    NA, "Brand6", "Brand6", "Bramd3", "BPand3"), F6_4_a = c(NA, 
    NA, "Brand10", NA, "Brand3", NA, "Brand6", "Brand6", "Brand3", 
    "Brand3"), F6_5 = c(NA, NA, NA, NA, "Brand6", NA, "Brand1", 
    "Brand1", NA, "Brand1"), F6_5_a = c(NA, NA, NA, NA, "Brand6", 
    NA, "Brand1", "Brand1", NA, "Brand1"), F6_6 = c(NA, NA, NA, 
    NA, "Brand6", NA, "Brand3", NA, NA, "Brand1"), F6_6_a = c(NA, 
    NA, NA, NA, "Brand6", NA, "Brand3", NA, NA, "Brand1"), F6_7 = c(NA, 
    NA, NA, NA, "Brand1", NA, "Brand1", NA, NA, "Brand1"), F6_7_a = c(NA, 
    NA, NA, NA, "Brand1", NA, "Brand1", NA, NA, "Brand1"), F6_8 = c(NA, 
    NA, NA, NA, "Brand1", NA, NA, NA, NA, "Brand6"), F6_8_a = c(NA, 
    NA, NA, NA, "Brand1", NA, NA, NA, NA, NA), F6_9 = c(NA, NA, 
    NA, NA, "Brrandu3", NA, NA, NA, NA, NA), F6_9_a = c(NA, NA, 
    NA, NA, "Brand3", NA, NA, NA, NA, NA), F6_10 = c(NA, NA, 
    NA, NA, "Brand6", NA, NA, NA, NA, NA), F6_10_a = c(NA, NA, 
    NA, NA, "Brand6", NA, NA, NA, NA, NA), F6_11 = c(NA, NA, 
    NA, NA, "Brand6", NA, NA, NA, NA, NA), F6_11_a = c(NA, NA, 
    NA, NA, "Brand6", NA, NA, NA, NA, NA), F6_12 = c(NA, NA, 
    NA, NA, "Brand6", NA, NA, NA, NA, NA), F6_12_a = c(NA, NA, 
    NA, NA, "Brand6", NA, NA, NA, NA, NA)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

I think you have pasted incomplete `dput` of expected output. — Ronak Shah, Oct 09 '19 at 06:14
From the help for stringdist-metrics, I learned the soundex method cannot distinguish between numbers, so your example of `df_joined_F6_1` has many more matches than you probably intended, since Braand1 is matched with Brand1 and Brand2 and Brand3... What would you like to happen for multiple matches? — Jon Spring, Oct 09 '19 at 06:37
Yes this is true, but the method for the matching is not part of my question. In the original data the brands names do not contain any numbers. I tried all the methods provided in the package and had the best results with "soundex". — jrabensc, Oct 09 '19 at 06:39

score 1 · Accepted Answer · answered Oct 09 '19 at 06:48

Here's an approach using tidyr to make the data longer, then doing the join, and then making it wide again.

df1 %>%
  rowid_to_column() %>%
  pivot_longer(-rowid, "col", values_to =  "brands") %>%
  stringdist_left_join(df2, method = "soundex") %>%

  # just keep first match, since real data won't have multiples
  group_by(rowid, col) %>%
  slice(1) %>%

  # tidying steps to make clean column titles
  rename("orig" = brands.x,
         "a" = brands.y) %>%
  gather(col2, val, c(orig, a)) %>%
  unite(col, c(col,col2))  %>%

  # make data wide again
  pivot_wider(names_from = col, values_from = val)

Thanks, this works perfectly. I just had to sort the columns in the end like my output table using select(). — jrabensc, Oct 09 '19 at 12:17

Fuzzy join strings on multiple columns in one dataset

1 Answers1