0

I would like to do a left_join(df1, df2) based on fuzzy matches. My df1 is 100k rows big and my df2 is 25k rows big. Basically I would like to calculate the string similarity with jaro winkler method between the join_colum of the two data frames. So this would be 100k * 25k comparisons.

I would like to have one match only and not multiple. So I am aiming for taking the match with the highest string similarity. In addition to that I would like to join only if the jaro winkler string similarity is at least 0.8 (let's assume this condition is fulfilled in my made up example).

I tried to show how the input looks like and what I would like to have as an output, however I am struggeling with the actual realization due to the size of my data frames. I already tried with functions from the fuzzyjoin library but I am running into memory issues. I also tried with the amatch function from the strindist package but I was not able to make it work.

These two solutions I could not adjust to solve my problem: fuzzy LEFT join with R, fuzzy outer join/merge in R

library(tidyverse)

df1 <- data.frame(
  id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  join_column = c("alice123burgerstorechicago", "alicewonderland",
                  "bubbletea45london", "blueonion",
                  "chandle34song", "crazyjoeohio",
                  "donaldduckshop123", "dartcommunitygermany",
                  "evergreen78hall", "exittheroom15florida"))


df2 <- data.frame(
  id = c(15, 16, 18, 20),
  join_column = c("aliceburgerstorechicag", 
                  "bubbletealndon",
                  "crazyjoeohio178",
                  "exittheroom25florid"))


pre_target <- df1
target <- pre_target %>% 
  mutate(joined_with_id = c(15, NA, 16, NA, NA, 18, NA, NA, NA, 20),
         joined_with_string = c("aliceburgerstorechicag", NA, 
                                "bubbletealndon", NA, NA, 
                                "crazyjoeohio178", NA, NA, NA, 
                                "exittheroom25florid"))

2 Answers2

0

No matter how you gonna approach this, the amount of comparisons is huge and will be always slow. There is not something as a fuzzy join preventing that. The fuzzy match needs to be made, while the join is the easy part as you do not try to join on the fuzzy part but on the match found based on the match.

Here a demo on your small sample set that shows at least how the join could be done.

library(data.table)
library(stringdist)

setDT(df1)
setDT(df2)

df1_options_to_join <- df1$join_column

my_best_match_function <- function(x) {
  z <- stringdist(x, df1_options_to_join, method = "jw")
  z <- setNames(df1_options_to_join, z)
  if(as.numeric(min(names(z))) <= 0.20) { # equals your 0.80 similarity
    return(z[min(names(z))])
  } else {
    return(NA)
  }
}

df2[, match_to_join := my_best_match_function(join_column), by = join_column]

target <- merge(df1, df2, by.x = "join_column", by.y = "match_to_join", all = T)
Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22
0

Complete self promotion, but I have written an R package, zoomerjoin, which uses MinHashing that allows you to fuzzily join large datasets without having to compare all pairs of rows between the two dataframes. This means that you can merge moderately-sized (millions of rows) dataframes in seconds or minutes on a modern data-science laptop without running out of memory.

The Minhashing algorithm is a randomized algorithm, meaning that there is a small chance that some pairs that should be joined will not be identified, but this probability can be made arbitrarily small by adjusting the hyperparameters.

Here's how I would use the package to join these dataframes:

library(tidyverse)
library(fuzzyjoin)

# Must have the Rust compiler installed to install this package
# see https://github.com/beniaminogreen/zoomerjoin for more details 
devtools::install_github("beniaminogreen/zoomerjoin")
library(zoomerjoin)


df1 <- data.frame(
  id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  join_column = c("alice123burgerstorechicago", "alicewonderland",
                  "bubbletea45london", "blueonion",
                  "chandle34song", "crazyjoeohio",
                  "donaldduckshop123", "dartcommunitygermany",
                  "evergreen78hall", "exittheroom15florida"))


df2 <- data.frame(
  id = c(15, 16, 18, 20),
  join_column = c("aliceburgerstorechicag",
                  "bubbletealndon",
                  "crazyjoeohio178",
                  "exittheroom25florid"))

## Stringdist Command
stringdist_out <- stringdist_inner_join(df1, df2,
                             by = "join_column",
                             max_dist =.4,
                             method = "jaccard")
# id.x              join_column.x id.y          join_column.y
# 1    1 alice123burgerstorechicago   15 aliceburgerstorechicag
# 2    3          bubbletea45london   16         bubbletealndon
# 3    4                  blueonion   16         bubbletealndon
# 4    6               crazyjoeohio   18        crazyjoeohio178
# 5   10       exittheroom15florida   20    exittheroom25florid



## Zoomerjoin Analogue
zoomerjoin_out <- jaccard_inner_join(df1,df2,
                      by = "join_column",
                      threshold =.6,
                      n_gram_width = 1)
# id.x              join_column.x id.y          join_column.y
#   3          bubbletea45london   16         bubbletealndon
#   1 alice123burgerstorechicago   15 aliceburgerstorechicag
#   6               crazyjoeohio   18        crazyjoeohio178
#   4                  blueonion   16         bubbletealndon
#  10       exittheroom15florida   20    exittheroom25florid

The package has been tested against the fuzzyjoin package to verify that it works correctly but is still in development, so please don't hesitate to file an issue if there are any bugs / issues with the documentation.