I want to join two dfs with fuzzyjoin::regex_left_join(df1,df2, by=c(name="name")
where df1
has 45k rows, and df2
has 2.5mil. This results in a memory error. If I split df1
up into chunks of 1000 rows, each chunk takes 15 minutes to run.
It turned out faster (but still slow) to have a for loop iterating through df1
rows, do the regex using the %ilike% operator from data.table
and parallelize this loop.
Is this the best I can get? Any way of speeding up the fuzzyjoin for large tables like this?
EDIT:
example data:
df1 <- tibble("name"=c("^(?i)SMITH( r.|-| [ivx]| .*)$","^(?i)BLACK( r.|-| [ivx]| .*)$","^(?i)MILLER( r.|-| [ivx]| .*)$"),
"fname1"=c("JOHN","THOMAS","JAMES"),
"id1"=c("aaaa","bbbb","cccc"))
df2 <- tibble("name"=c("Smith Jr.","Black III","Miller-Muller","Smith","Smith"),
"fname2"=c("Jon","Tom","Jamie","John","Johnathan"),
"id2"=c("1111","2222","3333","4444","5555"))