I have a dataframe of unstructured names, and I want to create a 'master' list of the cleaned name in one column with all the variants in another column. I am using the stringdist
package. Below is a small example:
library(dplyr) # for pipes
library(tidyr) # for expand_grid()
library(stringdist)
words <- c("dog","dot","don","con","cry","croak","cat","dogg", "dogy", "dog", "cat", "dog")
# compare everything to everything
words_df <- expand_grid(raw = words, clean = words) %>%
mutate(dist = stringdist(raw, clean, method = "jw") %>%
# compute word frequency
group_by(clean) %>%
mutate(count = n()) %>%
filter(dist < 0.3)
This results in a df with the distance and word counts for all similar enough combinations:
|raw |clean | dist| count|
|:---|:-----|---------:|-----:|
|dog |dog | 0.0000000| 36|
|dog |dot | 0.2222222| 12|
|dog |don | 0.2222222| 12|
|dog |dogg | 0.0833333| 12|
|dog |dogy | 0.0833333| 12|
|dog |dog | 0.0000000| 36|
|dog |dog | 0.0000000| 36|
|dot |dog | 0.2222222| 36|
|dot |dot | 0.0000000| 12|
|dot |don | 0.2222222| 12|
You can see that in the clean
column I have two entries for "dog" and "dogg", which I would want to collapse into one entry (dog) because the string "dog" appears more often.
Here is what I've tried so far:
dict <- words_df %>%
mutate(clean_new = ifelse(dist < 0.085, words_df[which.max(words_df$count)][[1]][1], clean))
Which results in:
|raw |clean | dist| count|clean_new |
|:---|:-----|---------:|-----:|:---------|
|dog |dog | 0.0000000| 36|NA |
|dog |dot | 0.2222222| 12|dot |
|dog |don | 0.2222222| 12|don |
|dog |con | 0.4444444| 12|con |
|dog |cry | 1.0000000| 12|cry |
|dog |croak | 0.4888889| 12|croak |
|dog |cat | 1.0000000| 24|cat |
|dog |dogg | 0.0833333| 12|NA |
|dog |dogy | 0.0833333| 12|NA |
|dog |dog | 0.0000000| 36|NA |
Essentially what I want to create is a dictionary of all variants of the word, based on frequency of the closest word match.
Thanks all!