Fuzzyjoin regex being very slow & running out of memory

Asked Feb 05 '22 at 14:44

Active Feb 05 '22 at 17:50

Viewed 188 times

I want to join two dfs with fuzzyjoin::regex_left_join(df1,df2, by=c(name="name") where df1 has 45k rows, and df2 has 2.5mil. This results in a memory error. If I split df1 up into chunks of 1000 rows, each chunk takes 15 minutes to run.

It turned out faster (but still slow) to have a for loop iterating through df1 rows, do the regex using the %ilike% operator from data.table and parallelize this loop.

Is this the best I can get? Any way of speeding up the fuzzyjoin for large tables like this?

EDIT:

example data:

df1 <- tibble("name"=c("^(?i)SMITH( r.|-| [ivx]| .*)$","^(?i)BLACK( r.|-| [ivx]| .*)$","^(?i)MILLER( r.|-| [ivx]| .*)$"),
              "fname1"=c("JOHN","THOMAS","JAMES"),
              "id1"=c("aaaa","bbbb","cccc"))

df2 <- tibble("name"=c("Smith Jr.","Black III","Miller-Muller","Smith","Smith"),
              "fname2"=c("Jon","Tom","Jamie","John","Johnathan"),
              "id2"=c("1111","2222","3333","4444","5555"))

edited Feb 05 '22 at 17:50

asked Feb 05 '22 at 14:44

Spine Feast

1

Perhaps. Can you give us a structure/minimal example of what df1 and df2 look like? – langtang Feb 05 '22 at 16:01
1

It may be faster to perform this in 2 parts. Do a `left_join()` then a fuzzy_join for the unmatched. If you do a simple `left_join()` how many matches do you get? – Dave2e Feb 05 '22 at 16:35
added example data. – Spine Feast Feb 05 '22 at 17:50
@Dave2e I don't get anything since the name column (that I'm joining by) is formatted differently in the two dfs, hence the need for regex – Spine Feast Feb 05 '22 at 18:20

Fuzzyjoin regex being very slow & running out of memory

0 Answers0

Linked