I have a survey dataset in which respondents described the location of their activity, usually as a town or city name. I want to identify each unique mention of the named cities and count the number of times each city was mentioned. The final output should be a vector with counts of the number of times each city was mentioned. One challenge is that city names may be misspelled, have odd capitalization, or be embedded within a longer string (which may also include more than one city). I have a master list of city names with proper capitalization and spelling which I have been trying to use as my pattern with the agrep function.
A sample chunk of the dataset is structured as follows:
survey <- c("Salem", "salem, ma","Manchester","Manchester-By-The-Sea")
master <- c("Beverly","Gloucester","Manchester-by-the-Sea","Nahant","Salem")
In this sample, the final result would be a vector:
result
[1] 0 0 2 0 2
I have been trying to construct a function using agrep to loop through the master vector so that it searches through the survey vector for matches, counts the number of matches, and then outputs the number of matches for each item of the master vector. Here is what I have so far, but I all get is NULL. Not sure what I am doing wrong and/or if there is a better way to approach this problem.
idx <- NULL
matches <- NULL
n.match <- function(pattern, x, ...) {
for (i in 1:length(pattern))
idx <- vector()
idx <- agrep(pattern[i],x,ignore.case=TRUE, value=FALSE, max.distance = 2)
matches[i] <- length(idx)
}
n.match(master,survey)
matches