I would like to do a left_join(df1, df2)
based on fuzzy matches.
My df1
is 100k rows big and my df2
is 25k rows big.
Basically I would like to calculate the string similarity with jaro winkler method
between the join_colum
of the two data frames. So this would be 100k * 25k comparisons.
I would like to have one match only and not multiple. So I am aiming for taking the match with the highest string similarity. In addition to that I would like to join only if the jaro winkler string similarity is at least 0.8 (let's assume this condition is fulfilled in my made up example).
I tried to show how the input looks like and what I would like to have as an output, however I am struggeling with the actual realization due to the size of my data frames. I already tried with functions from the fuzzyjoin library but I am running into memory issues. I also tried with the amatch function from the strindist package but I was not able to make it work.
These two solutions I could not adjust to solve my problem: fuzzy LEFT join with R, fuzzy outer join/merge in R
library(tidyverse)
df1 <- data.frame(
id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
join_column = c("alice123burgerstorechicago", "alicewonderland",
"bubbletea45london", "blueonion",
"chandle34song", "crazyjoeohio",
"donaldduckshop123", "dartcommunitygermany",
"evergreen78hall", "exittheroom15florida"))
df2 <- data.frame(
id = c(15, 16, 18, 20),
join_column = c("aliceburgerstorechicag",
"bubbletealndon",
"crazyjoeohio178",
"exittheroom25florid"))
pre_target <- df1
target <- pre_target %>%
mutate(joined_with_id = c(15, NA, 16, NA, NA, 18, NA, NA, NA, 20),
joined_with_string = c("aliceburgerstorechicag", NA,
"bubbletealndon", NA, NA,
"crazyjoeohio178", NA, NA, NA,
"exittheroom25florid"))