0

I want to join two dfs with fuzzyjoin::regex_left_join(df1,df2, by=c(name="name") where df1 has 45k rows, and df2 has 2.5mil. This results in a memory error. If I split df1 up into chunks of 1000 rows, each chunk takes 15 minutes to run.

It turned out faster (but still slow) to have a for loop iterating through df1 rows, do the regex using the %ilike% operator from data.table and parallelize this loop.

Is this the best I can get? Any way of speeding up the fuzzyjoin for large tables like this?

EDIT:

example data:

df1 <- tibble("name"=c("^(?i)SMITH( r.|-| [ivx]| .*)$","^(?i)BLACK( r.|-| [ivx]| .*)$","^(?i)MILLER( r.|-| [ivx]| .*)$"),
              "fname1"=c("JOHN","THOMAS","JAMES"),
              "id1"=c("aaaa","bbbb","cccc"))

df2 <- tibble("name"=c("Smith Jr.","Black III","Miller-Muller","Smith","Smith"),
              "fname2"=c("Jon","Tom","Jamie","John","Johnathan"),
              "id2"=c("1111","2222","3333","4444","5555"))
Spine Feast
  • 235
  • 1
  • 11

0 Answers0