Is there any way to weight specific words using the stringdist
package or another string distance package?
Often I have strings that share a common word such as "city" or "university" that get relatively close string distance matches as a result, but are very different (ie: "University of Utah" and "University of Ohio", or "XYZ City" and "ABC City").
I know that operations (delete, insert, replace) can be weighted differently depending on the algorithm, but I've not seen a way to include a list of words paired with weights. Any thoughts?
Certainly one option would be to str_remove
those common words prior to matching, but that has an issue in that "XYZ County" and "XYZ City" would look identical.
Example:
"University of Utah" and "University of Ohio"
stringdist("University of Utah", "University of Ohio") / max(nchar("University of Utah"), nchar("University of Ohio"))
The normalized string distance is 0.22222. This is relatively low. But really, the normalized OSA string distance between "Utah" and "Ohio" is 1:
4 / 18 = 0.222222
However, removing "University of" and other common strings like "State" before hand would lead to matches between "University of Ohio" and "Ohio State".
Weighting a string like "University of" to count for, say 0.25 of the actual number of characters used in the normalization denominator would reduce the impact of those common substrings, ie:
4 / (18 * 0.25) = 0.888888.
It gets fuzzy here when we consider doing the same to the State vs University example:
stringdist("University of Ohio", "Ohio State")
yields 16. But taking .25 of the denominator:
16 / (18 * .25) = 3.55555.
Perhaps a better option would be to use LCS but downweight substrings that match a list of common strings. So even though "University of Utah" and "University of Ohio" have a 14 character common substring, if "University of" appeared in this list, the LCS value for it would be reduced.
Edit: Another thought
I had another thought - using tidytext
package and unnest_tokens
, one can generate a list of most common words in all the strings that are being matched. It might be interesting to consider downweighting these words relative to their commonality in the dataset, since the more common they are, the less differentiating power they have...