vectorized text mining over multiple columns

Question

I have some code that I would like to vectorize but I am not sure how. The following code gives some example data, comprised of names and addreses.

name <- c("holiday inn", "geico", "zgf", "morton phillips")
address <- c("400 lafayette pl tupelo ms", "227 geico plaza chevy chase md", 
         "811 quincy st washington dc", "1911 1st st rockville md")

source1 <- data.frame(name, address)

name <- c("williams sonoma", "mamas bbq", "davis polk", "hop a long diner",
      "joes crag shack", "mike lowry place", "holiday inn", "zummer")

name2 <- c(NA, NA, NA, NA, NA, NA, "hi express", "zummer gunsul frasca")
address <- c("2 reads way new castle de", "248 w 4th st newark de",
         "1100 21st st nw washington dc", "1804 w 5th st wilmington de",
         "1208 kenwood parkway holdridge nb", "4203 ocean drive miami fl",
         "400 lafayette pl tupelo ms", "811 quincy st washington dc")
source2 <- data.frame(name, name2, address)

This block calculates levenshtein distince between two columns of text via R's native adist function and then applies the min function.

dist.name<- adist(source1$name,source2$name, partial = TRUE, ignore.case = TRUE)
dist.address <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)

min.name<-apply(dist.name, 2, min)
min.address <- apply(dist.address, 2, min)

I would like to do the following:

match source1$name with source2$name based on the minimum levenshtein distance.

If the results of 1 yield an NA, match based on source1$address and source2$address using levenshtein distance. I have tried using a for loop, which works fine for 1 but not 2. Here is the code I used to try and incorporate both:

match.s1.s2<-NULL  
for(i in 1:nrow(dist.name)){
  for(j in 1:nrow(dist.address)){
if(is.na(match(min.name[i], dist.name[i, ]))) {
s2.i <- match(min.address[j], dist.address[j,])
s1.i <- i
match.s1.s2 <- match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, 
                                         s1name=source1[s1.i,]$name, adist=min.name[j], 
                                         s1.i.address = source1[s1.i,]$address,
                                         s2.i.address = source2[s2.i,]$address),match.s1.s2)

} else {
  s2.i<-match(min.name[i],dist.name[i,])
  s1.i<-i
  match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=source2[s2.i,]$name, s1name=source1[s1.i,]$name, 
                            adist=min.name[i], s1.i.address = source1[s1.i,]$address,
                            s2.i.address = source2[s2.i,]$address),match.s1.s2)
    }

  }

}

My problem is it's slow and it ends up producing a data frame that is much too large. The end result, data frame match.s1.s2 should have the same number of rows as source1. Any advice or help would be much appreciated. Thanks.

@42- just edited code. should be just `address` as defined initially. — jvalenti, Jan 08 '18 at 19:48

Anderson Neisse · Accepted Answer · 2018-01-09T11:52:07.810

1

It would be more efficient to use normalized scores (between 0 and 1). That way you could use a vectorized ifelse to only change the NA for the correspondent score of address. With non-normalized scores you have to change the entire row. Try this approach:

dist.mat.nm <- adist(source1$name, source2$name, partial = TRUE, ignore.case = TRUE)
dist.mat.ad <- adist(source1$address, source2$address, partial = TRUE, ignore.case = TRUE)

#If you use non-normalized distances
dist.mat <- dist.mat.nm
for(i in 1:nrow(dist.mat)){
  if(is.na(dist.mat[i, ])) dist.mat[i, ] <- dist.mat.ad[i, ]
}

#If you use normalized distances
dist.mat <- ifelse(is.na(dist.mat.nm), dist.mat.ad, dist.mat.nm)

which.match <- function(x, nm) return(nm[which(x == min(x))[1]])

matches <- apply(dist.mat, 1, which.match, nm = source2$name)

That may improve the performance and solve your problem. If you're willing to change to a normalized distance (instead of levenshtein), I would recommend Jaro-Winkler's.

edited Jan 09 '18 at 11:52

answered Jan 08 '18 at 21:00

Anderson Neisse

118
8

Thanks, this is pretty intuitive. However it doesn't account for addresses as well, which is where the problem comes in. – jvalenti Jan 08 '18 at 21:10
Edited it to consider addresses. – Anderson Neisse Jan 09 '18 at 11:52
Wow. This works quite well. Very impressive. I have some other questions if you don't mind. is there a way to know which match was used (i.e. address or name?) for matches? Like, say via an indicator column? Also, why can't non-normalized distances be vectorized? – jvalenti Jan 09 '18 at 15:50
1

1. Yes, there is, actually the `which(x == min(x))` in the code returns the matches' positions. I use it with a `[1]` appended so it returns only the first match if there's a tie. Then I wrap it in a `nm[]` to return the values instead of the positions. If you want both, in roder to improve performance you can use: `function(x, nm) return(nm[which(x == min(x))[1]])` instead, which returns the positions, and then add a "matched_name" column with `source1$matched_name <- sorce2$name[matches]`. That way your function perform even better. – Anderson Neisse Jan 09 '18 at 16:13
1

2. Non-normalized distances vary in function of the strings' lengths. That way, you cant get a single value from the address matrix and insert to be compared with the name matrix. If you normalize them, both name and address matrices will vary in the same way, allowing for you to compare any mix of line's values in the `min` call. – Anderson Neisse Jan 09 '18 at 16:20
I don't understand what you mean by `function(x, nm) return(nm[which(x == min(x))[1]])`...isn't that already used in the `which.match` function? I was hoping to get the positions in each data frame that were matched, i.e. the indexes if you will so I could go back and inspect myself if need be, along with the matched addresses and names. sorry if i wasn't clear. – jvalenti Jan 09 '18 at 18:13
1

Yes, sorry. The correct code is: `function(x, nm) return(which(x == min(x))[1])`. If you call `apply` for this function asigning it to `matches`, then you would have the indexes. With the indexes you run `nm.matches <- source2$names[matches]`. – Anderson Neisse Jan 09 '18 at 18:16
why do you think Jaro Winkler is better than levenshtein dist? Is it based on errror rates? – jvalenti Jan 09 '18 at 21:33

vectorized text mining over multiple columns

1 Answers1