Using the tm
and slam
packages, this is a less naive approach that incorporates text-processing techniques:
## load the requisite libraries
library(tm)
library(slam)
First, create a corpus from the combined towns and water vectors. We are eventually going to calculate the distance between every town and every body of water based on the text.
corpus <- Corpus(VectorSource((c(towns, water))))
Here, I do some standard preprocessing by removing punctuation and stemming the "documents". Stemming finds the common underlying parts of words. For example, city and cities have the same stem: citi
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)
A standard Term Document Matrix has binary indicators for which words are in which documents. We want to encode additional information about how frequent the word is in the entire corpus as well. For example, we don't care how often "the" appears in a document because it is incredibly common.
tdm <- weightTfIdf(TermDocumentMatrix(corpus))
Lastly, we calculate the cosine distance between every document. The tm
package creates sparse matrices which are usually very memory efficient. The slam
package has matrix math functions for sparse matrices.
cosine_dist <- function(tdm) {
crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
}
d <- cosine_dist(tdm)
> d
Docs
Docs 1 2 3 4 5 6 7 8
1 1.00000000 0.034622992 0.038063800 0.044272011 0.00000000 0.0000000 0.000000000 0.260626250
2 0.03462299 1.000000000 0.055616255 0.064687275 0.01751883 0.0000000 0.146145917 0.006994714
3 0.03806380 0.055616255 1.000000000 0.071115850 0.01925984 0.0000000 0.006633427 0.007689843
4 0.04427201 0.064687275 0.071115850 1.000000000 0.54258275 0.0000000 0.007715340 0.008944058
5 0.00000000 0.017518827 0.019259836 0.542582752 1.00000000 0.0000000 0.014219656 0.016484228
6 0.00000000 0.000000000 0.000000000 0.000000000 0.00000000 1.0000000 0.121137618 0.000000000
7 0.00000000 0.146145917 0.006633427 0.007715340 0.01421966 0.1211376 1.000000000 0.005677459
8 0.26062625 0.006994714 0.007689843 0.008944058 0.01648423 0.0000000 0.005677459 1.000000000
Now we have a matrix of similarity scores between all of the towns and water bodies in the same matrix. We only care about the distances for half of this matrix, though. Hence the indexing notation in the apply function below:
best.match <- apply(d[5:8,1:4], 1, function(row) if(all(row == 0)) NA else which.max(row))
And here's the output:
> cbind(water, towns[best.match])
water
[1,] "Alturas City of" "Alturas city, Modoc County"
[2,] "Casitas Municipal Water District" NA
[3,] "California Water Service Company Bellflower City" "Bellflower city, Los Angeles County"
[4,] "Contra Costa City of Public Works" "Acalanes Ridge CDP, Contra Costa County"
Notice the NA value. NA is returned when there isn't a single word match between a body of water and all of the towns.