1

I have a survey dataset in which respondents described the location of their activity, usually as a town or city name. I want to identify each unique mention of the named cities and count the number of times each city was mentioned. The final output should be a vector with counts of the number of times each city was mentioned. One challenge is that city names may be misspelled, have odd capitalization, or be embedded within a longer string (which may also include more than one city). I have a master list of city names with proper capitalization and spelling which I have been trying to use as my pattern with the agrep function.

A sample chunk of the dataset is structured as follows:

survey <- c("Salem", "salem, ma","Manchester","Manchester-By-The-Sea")
master <- c("Beverly","Gloucester","Manchester-by-the-Sea","Nahant","Salem")

In this sample, the final result would be a vector:

result
[1] 0 0 2 0 2

I have been trying to construct a function using agrep to loop through the master vector so that it searches through the survey vector for matches, counts the number of matches, and then outputs the number of matches for each item of the master vector. Here is what I have so far, but I all get is NULL. Not sure what I am doing wrong and/or if there is a better way to approach this problem.

idx <- NULL
matches <- NULL
n.match <- function(pattern, x, ...) {
for (i in 1:length(pattern))
   idx <- vector()
   idx <- agrep(pattern[i],x,ignore.case=TRUE, value=FALSE, max.distance = 2)
   matches[i] <- length(idx)
}
n.match(master,survey)
matches
Marcos
  • 444
  • 4
  • 9

1 Answers1

2

The main problem is that you are missing a block {} around your for loop. You are really only initializing idx 5 times and leaving i set at 5. Plus there's no reason to keep variables needed inside your function outside as well. How about

survey <- c("Salem", "salem, ma","Manchester","Manchester-By-The-Sea")
master <- c("Beverly","Gloucester","Manchester-by-the-Sea","Nahant","Salem")

n.match <- function(pattern, x, ...) {
    matches <- numeric(length(pattern))
    for (i in 1:length(pattern)) {
       idx <- agrep(pattern[i],x,ignore.case=TRUE, max.distance = 2)
       matches[i] <- length(idx)
    }
    matches       
}
n.match(master,survey)
# [1] 0 0 1 0 2

Here i also played with max.distance= to make it a proportion rather than an absolute number. However it still looks like "Manchester" is too different than "Manchester-by-the-Sea" in terms of the number of deletions required to get them to match. You may consider down-weighting deletions

MrFlick
  • 195,160
  • 17
  • 277
  • 295