r which rows have longest partial string match between two vectors

Question

I have two vectors that contain the names of towns, both of which are in different formats, and I need to match the names of water districts (water) to their respective census data (towns). Essentially for each row in water, I need to know the best match in towns, since most of them contain similar words such as city. One other problem I see is that words are capitalized in one data set and are not capitalized in another. Here is my example data:

towns= c("Acalanes Ridge CDP, Contra Costa County", "Bellflower city, Los Angeles County", "Arvin city, Kern County", "Alturas city, Modoc County")

water=c("Alturas City of","Casitas Municipal Water District","California Water Service Company Bellflower City", "Contra Costa City of Public Works")

If you have a list with all the cities apriori then this becomes very easy to solve. I think it's better to try and avoid partial match as much as possible — Sotos, Apr 28 '16 at 17:36
Unfortunately I do not have that. If I was going to make that list it would probably be easier to just go through the 400 water districts and match them to any of the 1500 towns by hand. — Chris Heckman, Apr 28 '16 at 19:02

score 2 · Accepted Answer · answered Apr 28 '16 at 18:10

Using the tm and slam packages, this is a less naive approach that incorporates text-processing techniques:

## load the requisite libraries
library(tm)
library(slam)

First, create a corpus from the combined towns and water vectors. We are eventually going to calculate the distance between every town and every body of water based on the text.

corpus <- Corpus(VectorSource((c(towns, water))))

Here, I do some standard preprocessing by removing punctuation and stemming the "documents". Stemming finds the common underlying parts of words. For example, city and cities have the same stem: citi

corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)

A standard Term Document Matrix has binary indicators for which words are in which documents. We want to encode additional information about how frequent the word is in the entire corpus as well. For example, we don't care how often "the" appears in a document because it is incredibly common.

tdm <- weightTfIdf(TermDocumentMatrix(corpus))

Lastly, we calculate the cosine distance between every document. The tm package creates sparse matrices which are usually very memory efficient. The slam package has matrix math functions for sparse matrices.

cosine_dist <- function(tdm) {
  crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
}

d <- cosine_dist(tdm)
> d
    Docs
Docs          1           2           3           4          5         6           7           8
   1 1.00000000 0.034622992 0.038063800 0.044272011 0.00000000 0.0000000 0.000000000 0.260626250
   2 0.03462299 1.000000000 0.055616255 0.064687275 0.01751883 0.0000000 0.146145917 0.006994714
   3 0.03806380 0.055616255 1.000000000 0.071115850 0.01925984 0.0000000 0.006633427 0.007689843
   4 0.04427201 0.064687275 0.071115850 1.000000000 0.54258275 0.0000000 0.007715340 0.008944058
   5 0.00000000 0.017518827 0.019259836 0.542582752 1.00000000 0.0000000 0.014219656 0.016484228
   6 0.00000000 0.000000000 0.000000000 0.000000000 0.00000000 1.0000000 0.121137618 0.000000000
   7 0.00000000 0.146145917 0.006633427 0.007715340 0.01421966 0.1211376 1.000000000 0.005677459
   8 0.26062625 0.006994714 0.007689843 0.008944058 0.01648423 0.0000000 0.005677459 1.000000000

Now we have a matrix of similarity scores between all of the towns and water bodies in the same matrix. We only care about the distances for half of this matrix, though. Hence the indexing notation in the apply function below:

best.match <- apply(d[5:8,1:4], 1, function(row) if(all(row == 0)) NA else which.max(row))

And here's the output:

> cbind(water, towns[best.match])
     water                                                                                       
[1,] "Alturas City of"                                  "Alturas city, Modoc County"             
[2,] "Casitas Municipal Water District"                 NA                                       
[3,] "California Water Service Company Bellflower City" "Bellflower city, Los Angeles County"    
[4,] "Contra Costa City of Public Works"                "Acalanes Ridge CDP, Contra Costa County"

Notice the NA value. NA is returned when there isn't a single word match between a body of water and all of the towns.

@Zelanzy7 Thank you for your response. I was not able to get past creating tdm. Maybe I should mention that my water data is only 400 rows compared to the 1500 in towns. The error is gave me was: Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : 'i, j, v' different lengths In addition: Warning messages: 1: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code 2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), : NAs introduced by coercion — Chris Heckman, Apr 28 '16 at 18:49
I'm not sure what is going on with your example. If you run my code on your dummy data, it works just fine. Are you using vectors? Or is the data stored some other way like in data.frames? — Zelazny7, Apr 28 '16 at 19:27
@Zelanzy7 I have tried to come back to this and still cant get it to work. I know the example I gave you works, but can figure it out with my real data. The data I have is stored in data.frames but I have tried doing Corpus(VectorSource((c(example$example, example2$example2)))). I've also tried make to do tmp1= as.vector(example$example) and then added tmp1 and tmp2 into the corpus command. I've tried to do the same thing with as.list. I am at a lost for why this may possibly not be working. And again, the only error I get is when trying to make tdm, and that fails. I could post my data — Chris Heckman, May 18 '16 at 19:01
You could even post the output of `dput(head(example))` and that might be enough for troubleshooting. If you can post the whole `data.frame` somewhere, though, that would be ideal. — Zelazny7, May 18 '16 at 19:31
@Zelanzy7 Here are the two data sets as well as the code: https://drive.google.com/folderview?id=0B3mc92zjVzbMdU9zY0pRSWFJWHc&usp=sharing — Chris Heckman, May 18 '16 at 20:01
I'm not sure what's going on with your setup. I was able to execute your code without issue. I have uploaded the dataset with the matched city names for you: https://drive.google.com/open?id=0B-tqBaUiJLbPV3lFUHVxRGVOYm8 — Zelazny7, May 18 '16 at 20:34
@Zelanzy7 So I fixed the problem of running the object "tdm" by converting the text to utf-8 and removing the lines with special characters. I still cant get the same match you did though. I am guessing that you changed the indexing notation in best.match, but I dont understand what is going on well enough to know how to change it. I have to use this to merge a couple different data sets so a quick explanation on how to do that would be greatly appreciated. Right now I just shows all NA in the second column with no matches. — Chris Heckman, May 18 '16 at 23:40
Try this: `n <- length(tmp1); best.match <- apply(d[1:n,(n+1):ncol(d)], 1, function(row) if(all(row == 0)) NA else which.max(row)); out <- as.data.frame(cbind(tmp1, tmp2[best.match])) ` — Zelazny7, May 19 '16 at 18:18

Sotos · Answer 2 · 2016-04-29T08:53:51.343

Another possible way to do it using just base R. We split the strings from water using strsplit thus creating a list, and we check to see which of those strings are found in towns using grepl. We now have a list of 4 logical matrices. By applying rowSums, we get the sum of 'TRUE' for each row. We use which.max to identify the row with most 'TRUE' values. Finally, we use those values for indexing towns.

lst <- lapply(strsplit(water, ' '), function(i)
                       sapply(tolower(i), function(j)
                                 grepl(j, tolower(towns))))

ind <- unlist(as.numeric(lapply(lst, function(i)
                   which.max(rowSums(i)[!is.na(match(TRUE, i))]))))

cbind(water, towns[ind])
#            water                                                                                       
#[1,] "Alturas City of"                                  "Alturas city, Modoc County"             
#[2,] "Casitas Municipal Water District"                 NA                                       
#[3,] "California Water Service Company Bellflower City" "Bellflower city, Los Angeles County"    
#[4,] "Contra Costa City of Public Works"                "Acalanes Ridge CDP, Contra Costa County"

Side Note: I used [!is.na(match(TRUE, i))] to only calculate the rowSums when there are indeed 'TRUE' values in the matrix. Otherwise the rowSums of a 4 x 4 logical matrix with all 'FALSE' is 0, 0, 0, 0, and taking which.max(c(0, 0, 0, 0)) gives 1, which is quite interesting.

r which rows have longest partial string match between two vectors

2 Answers2

Linked