How to explicitly build sparse stringdistmatrix to avoid running out of memory?

Question

Match large number of slightly varying restaurant names in "data" vector to appropriate "match" vector:

The stringdistmatrix function in stringdist package is great, but runs out of memory for a few 10k x 10k and my data is larger.

Tried as(stringdistmatrix(data, match),'sparseMatrix') would give hoped for result, but runs out of memory. Hence, I would like to explicitly index pairs using sparseMatrix(i,j,x,dims,dimnames) with x calculated by adist() or similar string distance in hopes that it would fit in memory.

R

data <- c("McDonalds", "MacDonalds", "Mc Donald's", "Wendy's", "Wendys", "Wendy", 
          "Chipotle", "Chipotle's")

match <- c("McDonalds", "Wendys", "Chipotle")

Trying:

library(Matrix)
library(stringdist)

idx <- expand.grid(a=data,b=match)
idx$row <- match(idx$a,idx$b)
idx$col <- match(idx$b,idx$a)

library(Matrix)
sparseMatrix(i=idx$row, 
             j=idx$col,
             x=ifthen(adist(data,match)<2,1,0),
             dims=c(7,3),
             dimnames = list(data, match))

Hoped for output to match:

library(stringdist)
as(ifelse(stringdistmatrix(data,match)<2,1,0),'sparseMatrix')

Your distance matrix isn't sparse (i.e. it's not mostly zeroes). You need a different approach to store a huge non-sparse matrix. — user2554330, Jun 23 '19 at 20:58
Understood re: distance matrix, it seems like I could invert it with ifthen(adist(data, match)>2,0,1). I think I'm mainly struggling to understand how too index. — David Lucey, Jun 23 '19 at 21:01
idx$col <- match(idx$b,idx$a) was a best guess. I really don't know how to index this. The main objective was to try to get it to match library(stringdist) as(stringdist(data,match),'sparseMatrix') — David Lucey, Jun 23 '19 at 21:15
Maybe this question and my answer to it can help with this specific problem. https://stackoverflow.com/questions/56680860/is-there-an-efficient-strategy-for-doing-a-fuzzy-join-on-customer-data-to-identi/56694676#56694676 — TimTeaFan, Jun 23 '19 at 21:30
Which version of stringdist are you using. There was an issue with long strings for a long time which supposedly has been fixed now: https://github.com/markvanderloo/stringdist/issues/59 — JBGruber, Jun 23 '19 at 21:49
Thank you @TimTeaFan. I would block as per TimTeaFan, but McDonalds is in many zips so can't think of a natural subset for this one. It seems like others may have this challenge in the future. — David Lucey, Jun 23 '19 at 23:24
Thank you @JBGruber I have 700k restaurants nationwide, so think I'm above the threshold for 9.5.2. ING built an answer in Python using C++. https://medium.com/wbaa/https-medium-com-ingwbaa-boosting-selection-of-the-most-similar-entities-in-large-scale-datasets-450b3242e618, thought maybe I could help find one for R. — David Lucey, Jun 23 '19 at 23:25
Sorry if I am slow to follow. Are you just looking to pick the best match / closest in spelling, or is your approach dependent on using a matrix? — Andrew, Jun 24 '19 at 01:38
I understand that your specific question is about getting a sparse matrix to work with stringdist. The larger problem you are dealing with is how to efficiently build a stringdist matrix. I doubt that running out of memory is your only problem. Even if you had enough memory, calculating a 700k + stringdist matrix will take forever, unless you use some super machine/server. My subset approach should work. You could break your data set down my starting letter of the restaurant, or you could break your data into groups of a specific character length (for example 5-7 characters etc… . — TimTeaFan, Jun 24 '19 at 13:35
In summary, I was able to do this in Rstudio but using Python via reticulate and a combination of TF-IDF and Ngrams with cosine similarity calculated across all 750k names in about 45 minutes. Credit goes to Chris van den Berg and ING Bank Wholesale bank analytics which created the sparse_dot_topn package which is really fantastic. Here is the write up by Chris van Der Berg https://bergvca.github.io/2017/10/14/super-fast-string-matching.html and ING https://medium.com/wbaa/https-medium-com-ingwbaa-boosting-selection-of-the-most-similar-entities-in-large-scale-datasets-450b3242e618. — David Lucey, Jun 27 '19 at 17:21

score 2 · Accepted Answer · answered Jul 26 '19 at 11:42

If I understand your question correctly, your task is to match dirty strings with clean strings. You do not need the whole matrix for that (and it would indeed not be sparse). Instead you can use amatch.

library(stringdist)
data <- c("McDonalds", "MacDonalds", "Mc Donald's", "Wendy's", "Wendys", "Wendy", 
          "Chipotle", "Chipotle's")
match <- c("McDonalds", "Wendys", "Chipotle")
i <- amatch(data, match, method="osa",maxDist=2)
data.frame(data=data, matched_data = match[i], stringsAsFactors = FALSE)

         data matched_data
1   McDonalds    McDonalds
2  MacDonalds    McDonalds
3 Mc Donald's    McDonalds
4     Wendy's       Wendys
5      Wendys       Wendys
6       Wendy       Wendys
7    Chipotle     Chipotle
8  Chipotle's     Chipotle

It works perfectly and from none other than Mark van der Loo. Thanks for this answer, your online book, package, etc! — David Lucey, Jul 27 '19 at 14:45

How to explicitly build sparse stringdistmatrix to avoid running out of memory?

1 Answers1