R - String Distance with weighted words

Question

Is there any way to weight specific words using the stringdist package or another string distance package?

Often I have strings that share a common word such as "city" or "university" that get relatively close string distance matches as a result, but are very different (ie: "University of Utah" and "University of Ohio", or "XYZ City" and "ABC City").

I know that operations (delete, insert, replace) can be weighted differently depending on the algorithm, but I've not seen a way to include a list of words paired with weights. Any thoughts?

Certainly one option would be to str_remove those common words prior to matching, but that has an issue in that "XYZ County" and "XYZ City" would look identical.

Example:

"University of Utah" and "University of Ohio"

stringdist("University of Utah", "University of Ohio") / max(nchar("University of Utah"), nchar("University of Ohio"))

The normalized string distance is 0.22222. This is relatively low. But really, the normalized OSA string distance between "Utah" and "Ohio" is 1:

4 / 18 = 0.222222

However, removing "University of" and other common strings like "State" before hand would lead to matches between "University of Ohio" and "Ohio State".

Weighting a string like "University of" to count for, say 0.25 of the actual number of characters used in the normalization denominator would reduce the impact of those common substrings, ie:

4 / (18 * 0.25) = 0.888888.

It gets fuzzy here when we consider doing the same to the State vs University example:

stringdist("University of Ohio", "Ohio State")

yields 16. But taking .25 of the denominator:

16 / (18 * .25) = 3.55555.

Perhaps a better option would be to use LCS but downweight substrings that match a list of common strings. So even though "University of Utah" and "University of Ohio" have a 14 character common substring, if "University of" appeared in this list, the LCS value for it would be reduced.

Edit: Another thought

I had another thought - using tidytext package and unnest_tokens, one can generate a list of most common words in all the strings that are being matched. It might be interesting to consider downweighting these words relative to their commonality in the dataset, since the more common they are, the less differentiating power they have...

When you want to ignore common words, and/or words are in differing orders, the `qgram` distance (`stringdist::stringdist(x,y,method='qgram',q=3)`) often works well. — , Aug 06 '18 at 13:10
Thanks Mark, I see that since `qgram` counts qgrams that are not in common, "University of Ohio" and "University of Utah" would account for the "Ohio/Utah" and not be affected by the "University of". Very helpful! — jzadra, Aug 06 '18 at 18:03

Steven Beaupré · Accepted Answer · 2018-05-27T10:23:22.053

Maybe one idea could be to regroup similar terms before calculating the string distance in order to avoid comparing "Ohio State" and "University of Ohio" altogether.

# Strings
v1 <- c("University of Ohio", "University of Utah", "Ohio State", "Utah State",
        "University Of North Alabama", "University of South Alabama", "Alabama State",
        "Arizona State University Polytechnic", "Arizona State University Tempe", 
        "Arizona State", "Metropolitan State University of Denver", 
        "Metropolitan University Of The State Of Denver", "University Of Colorado", 
        "Western State Colorado University", "The Dalton College", "The Colorado State", 
        "The Dalton State College", "Columbus State University", "Dalton College")

# Remove stop words
v2 <- strsplit(v1, " ") %>% 
  map_chr(~ paste(.x[!tolower(.x) %in% tm::stopwords()], collapse = " "))

# Define groups
groups <- c(Group1 = "state", 
            Group2 = "university", 
            Group3 = "college",
            # Groups 4-5 must contain BOTH terms
            Group4 = ".*(state.*university|university.*state).*", 
            Group5 = ".*(state.*college|college.*state).*")

# Iterate over the list and assign groups
dat <- list(words = v2, pattern = groups)
lst <- dat$pattern %>% map(~ grepl(.x, dat$words, ignore.case = TRUE))

lst %>%
  # Make sure groups 1 to 3 and 4-5 are mutually exclusive
  # i.e: if a string contains "state" AND "university" (Group4), it must not be in Group1
  modify_at(c("Group1", "Group2", "Group3"), 
            ~ ifelse(lst$Group4 & .x | lst$Group5 & .x, !.x, .x)) %>%
  # Return matches from strings 
  map(~ v2[.x]) %>%
  # Compute the stringdistance for each group
  map(~ stringdistmatrix(.x, .x)) ## Maybe using method = "jw" ?

Thanks! I like this. Grouping based on terms that would produce false-similarity is a great idea. I will give it a try. — jzadra, May 29 '18 at 18:46

R - String Distance with weighted words

1 Answers1