Replace string with most frequent fuzzy match

Question

I have a dataframe of unstructured names, and I want to create a 'master' list of the cleaned name in one column with all the variants in another column. I am using the stringdist package. Below is a small example:

library(dplyr) # for pipes 
library(tidyr) # for expand_grid()
library(stringdist) 

words <- c("dog","dot","don","con","cry","croak","cat","dogg", "dogy", "dog", "cat", "dog")
# compare everything to everything 
words_df <- expand_grid(raw = words, clean = words) %>%
    mutate(dist = stringdist(raw, clean, method = "jw") %>% 
    # compute word frequency 
    group_by(clean) %>%
    mutate(count = n()) %>%
filter(dist < 0.3)

This results in a df with the distance and word counts for all similar enough combinations:

|raw |clean |      dist| count|
|:---|:-----|---------:|-----:|
|dog |dog   | 0.0000000|    36|
|dog |dot   | 0.2222222|    12|
|dog |don   | 0.2222222|    12|
|dog |dogg  | 0.0833333|    12|
|dog |dogy  | 0.0833333|    12|
|dog |dog   | 0.0000000|    36|
|dog |dog   | 0.0000000|    36|
|dot |dog   | 0.2222222|    36|
|dot |dot   | 0.0000000|    12|
|dot |don   | 0.2222222|    12|

You can see that in the clean column I have two entries for "dog" and "dogg", which I would want to collapse into one entry (dog) because the string "dog" appears more often.

Here is what I've tried so far:

dict <- words_df %>%
    mutate(clean_new = ifelse(dist < 0.085, words_df[which.max(words_df$count)][[1]][1], clean))

Which results in:

|raw |clean |      dist| count|clean_new |
|:---|:-----|---------:|-----:|:---------|
|dog |dog   | 0.0000000|    36|NA        |
|dog |dot   | 0.2222222|    12|dot       |
|dog |don   | 0.2222222|    12|don       |
|dog |con   | 0.4444444|    12|con       |
|dog |cry   | 1.0000000|    12|cry       |
|dog |croak | 0.4888889|    12|croak     |
|dog |cat   | 1.0000000|    24|cat       |
|dog |dogg  | 0.0833333|    12|NA        |
|dog |dogy  | 0.0833333|    12|NA        |
|dog |dog   | 0.0000000|    36|NA        |

Essentially what I want to create is a dictionary of all variants of the word, based on frequency of the closest word match.

Thanks all!

Just to point out for the future, this code will be very inefficient if you are scaling it up to a very long list of words. If there are many repeats it would calculate the distance scores of identical word pairs redundantly. A more efficient solution would be to calculate the pairwise distances on `unique(words)` and tabulate the counts of each word separately, then combine the two. — qdread, Feb 05 '20 at 19:35

score 2 · Accepted Answer · answered Feb 05 '20 at 19:22

This dplyr pipe statement will return a data frame with 9 rows, one for each of the unique elements in your original words vector. First we group_by the raw column which creates a group for each unique word, then filter by your distance threshold, then find the corresponding word in clean with the highest frequency in the original dataset. In your example all words match themselves except for the two variants of "dog."

Code

words_df %>%
  group_by(raw) %>%
  filter(dist < 0.085) %>%
  summarize(clean = clean[which.max(count)])

Output

# A tibble: 9 x 2
  raw   clean
  <chr> <chr>
1 cat   cat  
2 con   con  
3 croak croak
4 cry   cry  
5 dog   dog  
6 dogg  dog  
7 dogy  dog  
8 don   don  
9 dot   dot

Replace string with most frequent fuzzy match

1 Answers1

Code

Output