1

Match large number of slightly varying restaurant names in "data" vector to appropriate "match" vector:

The stringdistmatrix function in stringdist package is great, but runs out of memory for a few 10k x 10k and my data is larger.

Tried as(stringdistmatrix(data, match),'sparseMatrix') would give hoped for result, but runs out of memory. Hence, I would like to explicitly index pairs using sparseMatrix(i,j,x,dims,dimnames) with x calculated by adist() or similar string distance in hopes that it would fit in memory.

R

data <- c("McDonalds", "MacDonalds", "Mc Donald's", "Wendy's", "Wendys", "Wendy", 
          "Chipotle", "Chipotle's")

match <- c("McDonalds", "Wendys", "Chipotle")

Trying:

library(Matrix)
library(stringdist)

idx <- expand.grid(a=data,b=match)
idx$row <- match(idx$a,idx$b)
idx$col <- match(idx$b,idx$a)

library(Matrix)
sparseMatrix(i=idx$row, 
             j=idx$col,
             x=ifthen(adist(data,match)<2,1,0),
             dims=c(7,3),
             dimnames = list(data, match))

Hoped for output to match:

library(stringdist)
as(ifelse(stringdistmatrix(data,match)<2,1,0),'sparseMatrix')
David Lucey
  • 252
  • 3
  • 9
  • Your distance matrix isn't sparse (i.e. it's not mostly zeroes). You need a different approach to store a huge non-sparse matrix. – user2554330 Jun 23 '19 at 20:58
  • Are all the `NA`s in `idx$row` intended? – jay.sf Jun 23 '19 at 21:00
  • Understood re: distance matrix, it seems like I could invert it with ifthen(adist(data, match)>2,0,1). I think I'm mainly struggling to understand how too index. – David Lucey Jun 23 '19 at 21:01
  • 1
    NA's are not intended. – David Lucey Jun 23 '19 at 21:03
  • @DavidLucey ok, fix your code please – jay.sf Jun 23 '19 at 21:04
  • idx$col <- match(idx$b,idx$a) was a best guess. I really don't know how to index this. The main objective was to try to get it to match library(stringdist) as(stringdist(data,match),'sparseMatrix') – David Lucey Jun 23 '19 at 21:15
  • Maybe this question and my answer to it can help with this specific problem. https://stackoverflow.com/questions/56680860/is-there-an-efficient-strategy-for-doing-a-fuzzy-join-on-customer-data-to-identi/56694676#56694676 – TimTeaFan Jun 23 '19 at 21:30
  • Which version of stringdist are you using. There was an issue with long strings for a long time which supposedly has been fixed now: https://github.com/markvanderloo/stringdist/issues/59 – JBGruber Jun 23 '19 at 21:49
  • Thank you @TimTeaFan. I would block as per TimTeaFan, but McDonalds is in many zips so can't think of a natural subset for this one. It seems like others may have this challenge in the future. – David Lucey Jun 23 '19 at 23:24
  • Thank you @JBGruber I have 700k restaurants nationwide, so think I'm above the threshold for 9.5.2. ING built an answer in Python using C++. https://medium.com/wbaa/https-medium-com-ingwbaa-boosting-selection-of-the-most-similar-entities-in-large-scale-datasets-450b3242e618, thought maybe I could help find one for R. – David Lucey Jun 23 '19 at 23:25
  • Sorry if I am slow to follow. Are you just looking to pick the best match / closest in spelling, or is your approach dependent on using a matrix? – Andrew Jun 24 '19 at 01:38
  • I understand that your specific question is about getting a sparse matrix to work with stringdist. The larger problem you are dealing with is how to efficiently build a stringdist matrix. I doubt that running out of memory is your only problem. Even if you had enough memory, calculating a 700k + stringdist matrix will take forever, unless you use some super machine/server. My subset approach should work. You could break your data set down my starting letter of the restaurant, or you could break your data into groups of a specific character length (for example 5-7 characters etc… . – TimTeaFan Jun 24 '19 at 13:35
  • In summary, I was able to do this in Rstudio but using Python via reticulate and a combination of TF-IDF and Ngrams with cosine similarity calculated across all 750k names in about 45 minutes. Credit goes to Chris van den Berg and ING Bank Wholesale bank analytics which created the sparse_dot_topn package which is really fantastic. Here is the write up by Chris van Der Berg https://bergvca.github.io/2017/10/14/super-fast-string-matching.html and ING https://medium.com/wbaa/https-medium-com-ingwbaa-boosting-selection-of-the-most-similar-entities-in-large-scale-datasets-450b3242e618. – David Lucey Jun 27 '19 at 17:21

1 Answers1

2

If I understand your question correctly, your task is to match dirty strings with clean strings. You do not need the whole matrix for that (and it would indeed not be sparse). Instead you can use amatch.

library(stringdist)
data <- c("McDonalds", "MacDonalds", "Mc Donald's", "Wendy's", "Wendys", "Wendy", 
          "Chipotle", "Chipotle's")
match <- c("McDonalds", "Wendys", "Chipotle")
i <- amatch(data, match, method="osa",maxDist=2)
data.frame(data=data, matched_data = match[i], stringsAsFactors = FALSE)

         data matched_data
1   McDonalds    McDonalds
2  MacDonalds    McDonalds
3 Mc Donald's    McDonalds
4     Wendy's       Wendys
5      Wendys       Wendys
6       Wendy       Wendys
7    Chipotle     Chipotle
8  Chipotle's     Chipotle
  • It works perfectly and from none other than Mark van der Loo. Thanks for this answer, your online book, package, etc! – David Lucey Jul 27 '19 at 14:45