Order Independent String Matching in R

Question

I am trying to match names in Table A to the the names present in master table. The order of names present in Table A is not exactly in a consistent format which means not necessarily name will start with first name, it's all random in some cases it could be starting with last name as well.

Illustration:

#Table A
word <- c("PILLAY NOLAN VICTOR", "PILLAY NICHOLAS")
#Master Table
choices <- c("IGOR JOSE VICTOR","WILLIAM NICHOLAS","NOLAN PILLAY","NICHOLAS PILLAY")

Executing the below code:

data <- NULL

    df <- foreach(a = idivix(length(word),chunks = no_cores),  .combine = "rbind", .packages = 'stringdist') %dopar% {
      do.call('rbind', lapply(seq(a$i,length.out = a$m), function(i)
      {
        tryCatch({
          #library(stringdist)
          d = expand.grid(word[i],choices)
          names(d) <- c("traveler_names","people_name")
          d$dist <-stringdist(d$traveler_names,d$people_name, method = "lv")
          d <- d[order(d$dist),]
          d <- d[1,]
          data<-rbind(data,d)
        }, error=function(e){})
      }))
    }

returns following match:

traveler name           people name          dist
PILLAY NOLAN VICTOR     IGOR JOSE VICTOR     10
PILLAY NICHOLAS         WILLIAM NICHOLAS      3

instead of matching with "NOLAN PILLAY" & "NICHOLAS PILLAY" because of the order dependence in string matching.

Is there any way to get the desired result in R, basically order independent string matching? Would really appreciate the help...

I would just create new columns with the words in alphabetical order. Then match on those. — Gregor Thomas, Feb 25 '19 at 16:31
Are you saying that you want to prioritize matching on last names, rather than the first name to appear in the string? — jdobres, Feb 25 '19 at 16:31

Andrew · Answer 1 · 2019-02-26T15:10:21.683

I have found that with a lot of data, the stringdist function can get bogged down. So, if you run into issues with speed, there are other options for packages (e.g., the RecordLinkage package, agrep), and other methods for matching strings (i.e., other measures of distance). Also, it is not 100% clear what you are asking, but if your issue is that you want to test to flip first and last names you could always use strsplit. For example,

> library(stringdist)
> 
> #Table A
> word <- c("PILLAY NOLAN VICTOR", "PILLAY NICHOLAS")
> #Master Table
> choices <- c("IGOR JOSE VICTOR","WILLIAM NICHOLAS","NOLAN PILLAY","NICHOLAS PILLAY")
> 
> # Try # 1
> match_dist <- sapply(word,
+        function(x) min(stringdist(x, choices, method = "lv")))
> 
> match_text <- sapply(word,
+        function(x) choices[which.min(stringdist(x, choices, method = "lv"))])
> 
> df <- data.frame("traveler name" = word,
+                  "people name" = match_text, 
+                  "dist" = match_dist, stringsAsFactors = FALSE, row.names = NULL)
> # Checking results
> df
        traveler.name      people.name dist
1 PILLAY NOLAN VICTOR IGOR JOSE VICTOR    9
2     PILLAY NICHOLAS WILLIAM NICHOLAS    3
> 
> 
> # Reversing srings, assuming names are sepearated by a space
> reversed <- sapply(strsplit(choices, " "), function(x) paste(rev(x), collapse=" ")) #reversing words
> choices <- c(choices, reversed)
> choices <- unique(choices)
> 
> 
> # Try # 2
> match_dist <- sapply(word,
+                      function(x) min(stringdist(x, choices, method = "lv")))
> 
> match_text <- sapply(word,
+                      function(x) choices[which.min(stringdist(x, choices, method = "lv"))])
> 
> df <- data.frame("traveler name" = word,
+                  "people name" = match_text, 
+                  "dist" = match_dist, stringsAsFactors = FALSE, row.names = NULL)
> 
> # Checking the new results
> df
        traveler.name     people.name dist
1 PILLAY NOLAN VICTOR    PILLAY NOLAN    7
2     PILLAY NICHOLAS PILLAY NICHOLAS    0

Depending on how your data is set up you may find it helpful (or not) to get rid of middle names, or clean the data in other ways but this should get your started.

EDIT:

I tested a couple different solutions but did not test agrep so that may be worth checking out. I would definitely favor RecordLinkage and I would even consider breaking up your data set into perfect matches and non-matches, and then only reversing (or sorting) the non-matches. The code will bottleneck calculating the measure of distance, so anything to reduce the number of names needing a measure of distance will probably help you.

> library(stringdist)
> library(RecordLinkage)
> library(microbenchmark)
> 
> #Table A
> word <- c("PILLAY NOLAN VICTOR", "PILLAY NICHOLAS", "WILLIAM NICHOLAS")
> #Master Table
> choices <- c("IGOR JOSE VICTOR","WILLIAM NICHOLAS","NOLAN PILLAY","NICHOLAS PILLAY")
> 
> microbenchmark({
+ 
+ # All reversed
+ reversed <- sapply(strsplit(choices, " "), function(x) paste(rev(x), collapse=" ")) #reversing words
+ choices1 <- c(choices, reversed)
+ choices1 <- unique(choices1)
+ 
+ match_dist <- sapply(word, function(x) min(stringdist(x, choices1, method = "lv")))
+ match_text <- sapply(word, function(x) choices1[which.min(stringdist(x, choices1, method = "lv"))])
+ 
+ df1 <- data.frame("traveler name" = word, 
+                  "people name" = match_text,
+                  "dist" = match_dist, 
+                  stringsAsFactors = FALSE, row.names = NULL)
+ }, 
+ 
+ {
+ # Record linkage
+ reversed <- sapply(strsplit(choices, " "), function(x) paste(rev(x), collapse=" ")) #reversing words
+ choices2 <- c(choices, reversed)
+ choices2 <- unique(choices2)
+   
+ match_dist2 <- sapply(word, function(x) min(levenshteinDist(x, choices2)))
+ match_text2 <- sapply(word, function(x) choices2[which.min(levenshteinDist(x, choices2))])
+   
+ df2 <- data.frame("traveler name" = word, 
+                   "people name" = match_text2,
+                   "dist" = match_dist2, 
+                   stringsAsFactors = FALSE, row.names = NULL)
+ },
+ 
+ {
+ # Sorted
+ 
+ sorted <- sapply(strsplit(choices, " "), function(x) paste(sort(x), collapse=" ")) #sorting choices
+ choices3 <- c(choices, sorted)
+ choices3 <- unique(choices3)
+ word3 <- sapply(strsplit(word, " "), function(x) paste(sort(x), collapse=" ")) #sorting words
+ 
+ match_dist3 <- sapply(word3, function(x) min(stringdist(x, choices3, method = "lv")))
+ match_text3 <- sapply(word3, function(x) choices3[which.min(stringdist(x, choices3, method = "lv"))])
+ 
+ df3 <- data.frame("traveler name" = word3, 
+                   "people name" = match_text3,
+                   "dist" = match_dist3, 
+                   stringsAsFactors = FALSE, row.names = NULL)
+ },
+ times = 1)
Unit: milliseconds


    expr          min       lq     mean   median       uq      max neval
revers     6.627258 6.627258 6.627258 6.627258 6.627258 6.627258     1
reversRL   4.016632 4.016632 4.016632 4.016632 4.016632 4.016632     1
sort       7.223453 7.223453 7.223453 7.223453 7.223453 7.223453     1
> 
> all.equal(df1, df2)
[1] TRUE
> 
> df2
        traveler.name      people.name dist
1 PILLAY NOLAN VICTOR     PILLAY NOLAN    7
2     PILLAY NICHOLAS  PILLAY NICHOLAS    0
3    WILLIAM NICHOLAS WILLIAM NICHOLAS    0
> df3
        traveler.name      people.name dist
1 NOLAN PILLAY VICTOR     NOLAN PILLAY    7
2     NICHOLAS PILLAY  NICHOLAS PILLAY    0
3    NICHOLAS WILLIAM NICHOLAS WILLIAM    0

Thanks Andrew! Data is huge which I am trying to match therefore need to do parallel computing. Secondly, data containing names is very irregular, not everywhere name is starting with last name, at some places names is still starting with first name and they are getting perfectly matched. So, its very difficult to identify from the huge list where the name is starting with last name hence could not apply strsplit and looking for a solution where order of name doesn't matter. I have tried all the methods in "stringddist" but everywhere match is incorrect because of the name order. — chitvan gupta, Feb 26 '19 at 07:53
A lot of data indeed. Unfortunately, I do not think you can get around rearranging the names in some form or fashion (either by sorting as @Gregor suggested, or by reversing strings). I'll post an edit in a sec but using `RecordLinkage` has been much faster for me (and I'm assuming more efficient). Also, jarowinkler's distance usually works well for names if you want to try that one. — Andrew, Feb 26 '19 at 15:00

Order Independent String Matching in R

1 Answers1