Fuzzy join with 2 large data frames

Question

Here is my example:

id <- 1:5
names_1 <- c("hannah", "marcus", "fred", "joe", "lara")
df_1 <- data.frame(id, names_1)
df_1$phonenumberFound <- NA


names_2 <- c("hannah", "markus", "fredd", "joey", "paul", "mary", "olivia")
phone <- c(123, 234, 345, 456, 567, 678, 789)
df_2 <- data.frame(names_2, phone)

What I want to achieve is:

If one of the names in df_2 (at least approximately) matches a name in df_1, then I want to add the corresponding phone number in df_1.

Basically, it's some kind of a fuzzy left join but I have not succeeded to do it.

In fact, my true df_1 has 30.000 rows and my true df_2 has 500.000 rows. Is there a fast way to do this?

Thank you!

EDIT:

I need to change and clarify my example as I'm running into memory issues using the answers provided so far. (I'm using a Windows notebook with 16 GB RAM.)

id_1 <- 1:30000
names_1 <- sample(c("hannah", "marcus", "fred", "joe", "lara"), 30000, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
df_1 <- data.frame(id_1, names_1)
df_1$numberFound <- NA

id_2 <- 1:500000
names_2 <- sample(c("hannah", "markus", "paul", "mary", "olivia"), 500000, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
anyNumber <- sample(c(123, 234, 345, 456, 567), 500000, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
df_2 <- data.frame(id_2, names_2, anyNumber)

Any helpful comments and answers are highly appreciated.

akrun · Answer 1 · 2019-12-09T14:10:38.337

3

Here is one option with fuzzyjoin

library(fuzzyjoin)
stringdist_right_join((df_2, df_1, by = c("names_2" = "names_1")) %>%
      select(names(df_1), phone)
#  id names_1 phone
#1  1  hannah   123
#2  2  marcus   234
#3  3    fred   345
#4  4     joe   456
#5  5    lara   678

Or create a matrix with stringdistmatrix from stringdist package

library(stringdist)
df_2$phone[max.col(-stringdistmatrix(df_1$names_1, df_2$names_2), 'first')]

edited Dec 09 '19 at 14:10

answered Dec 09 '19 at 13:37

akrun

874,273
37
540
662

Thanks, but I'm running into "Error: cannot allocate vector of size 68.2 Gb" and I really don't understand why. – Rami Al-Fahham Dec 09 '19 at 14:00

score 3 · Answer 2 · answered Dec 09 '19 at 14:00

3

We can use adist which computes string distance between character vectors.

adist(df_1$names_1, df_2$names_2)
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#[1,]    0    5    6    6    5    5    6
#[2,]    5    1    5    6    4    3    6
#[3,]    6    5    1    3    4    4    6
#[4,]    6    6    4    1    4    4    6
#[5,]    4    4    5    4    3    2    4

Define some suitable threshold which can be allowed and assign the corresponding phone column.

thresh <- 1
mat <- adist(df_1$names_1, df_2$names_2) <= thresh
inds <- max.col(mat) * (rowSums(mat) > 0)
df_1$phone <- df_2$phone[replace(inds, inds == 0, NA)]

df_1
#  id names_1 phone
#1  1  hannah   123
#2  2  marcus   234
#3  3    fred   345
#4  4     joe   456
#5  5    lara    NA

However, since this generates a m by n matrix it might not be the most efficient method.

answered Dec 09 '19 at 14:00

Ronak Shah

377,200
20
156
213

Also a nice one, but again I'm running into memory issues. – Rami Al-Fahham Dec 09 '19 at 14:55
yeah...If the data is too big this will give memory issues unless you have lot of memory. – Ronak Shah Dec 10 '19 at 00:28
@RamiAl-Fahham the way to come around memory issues like is to chunk up your data, Find out if you can subset on some characteristics in each dataset that will plausibly help you tell the fuzzy mathching algorithm to ignore all the most unlikely match cases - a lot of the possible matches (especially with a high distance) will be redundant. – emilBeBri Oct 01 '20 at 08:22

Fuzzy join with 2 large data frames

2 Answers2

Linked