1

I've been scratching my head trying to find a way to solve this problem without having to get into NLP and start training models. I have 2 rather large data sets that should be able to be matched by name, but the spellings and syntax of them are slightly different, easy enough for a human to understand, but complex enough that my fuzzy matching and levenshtein edit distances cannot. There is a large amount of duplicates in the data set, but enough where I cannot manually map them, so I am trying to create "rules" around what to match. Would a package such as FuzzyWuzzy allow more bespoke elements to solve this? Example below, thanks!

a <- c("The City of New York", "The City of New York", "Los Angeles City", "The State of California", "The State of California")

b <- c("New York City", "New York City", "Los Angeles", "California State", "CA State") 

The closest I've gotten so far in matching the data sets is fuzzy string matching, but this only works decently well, and still misses a large chunk or makes quite a bit of errors as I increase the max edit distance.

library(fuzzyjoin)

a <- as.tibble(a)
b <- as.tibble(b)
stringdist_inner_join(x = a, y = b, max_dist = 3, method = 'LV', ignore_case = T)

The dataset as a bit more depth than that. I was hoping to make a "rule" of some sort that "The City of New York" always equals "New York City", but I'm not sure if there's a smarter way to go about this. I hope this specific text example helps. Thanks a bunch!

HFguitar
  • 11
  • 2
  • 1
    How many city you have and how many variance you may have for each city? Sometimes, the best way is the simplest way, have a mapping list. – Sinh Nguyen Feb 09 '21 at 23:36
  • @SinhNguyen there's a few hundred unique strings. I'm figuring this may be the best way to go, thanks! – HFguitar Feb 10 '21 at 01:34

0 Answers0