Here are 2 ways I'd approach it, one that's strictly supervised and more manual, and another that takes a less supervised route. The package stringdist
has a bunch of different distance metrics, where "lv"
is Levenshtein. I added an additional observation "poundcake" to test with a word that's too far from the reference words.
Option 1
Get a matrix of the distances between each experimental string and one of the reference strings. This could have issues if you have 2 similar reference strings, or if an experimental word is equally close to 2 references, but it works for this simple case. Then reshape the matrix into a data frame, and count along reference words to get cluster numbers. Filter for cases where the distance is less than your threshold.
library(dplyr)
library(stringdist)
max_dist <- 3
ref <- c("cake", "brownies")
expr <- c("cak", "cakee", "cake", "poundcake", "rownies","browwnies")
mtx <- stringdistmatrix(ref, expr, method = "lv", useNames = "strings")
mtx
#> cak cakee cake poundcake rownies browwnies
#> cake 1 1 0 5 6 8
#> brownies 8 7 7 8 1 1
df1 <- as.data.frame(mtx) %>%
tibble::rownames_to_column("ref") %>%
tidyr::pivot_longer(-ref, names_to = "expr", values_to = "dist") %>%
mutate(clust = as.numeric(forcats::as_factor(ref))) # could also use data.table::rleid
df1 %>%
filter(dist <= max_dist)
#> # A tibble: 5 × 4
#> ref expr dist clust
#> <chr> <chr> <dbl> <dbl>
#> 1 cake cak 1 1
#> 2 cake cakee 1 1
#> 3 cake cake 0 1
#> 4 brownies rownies 1 2
#> 5 brownies browwnies 1 2
Option 2
This might work for more complex cases. I've used it for correcting the spelling of people's names, where I have an incomplete set of correct labels to work from. Combine all the words into 1 vector, get a distance matrix (this time it will be square), then create clusters from hierarchical clustering using the threshold as the height to cut the tree. You can then match the reference for each word to get labels for the clusters.
The downside here is that you have rows for reference words that weren't experimental—note for example that "brownies" was never spelled correctly in the experimental strings, but now you have that observation.
all_words <- c(ref, expr)
hc <- hclust(stringdistmatrix(all_words, method = "lv", useNames = "strings"))
df2 <- data.frame(word = c(ref, expr),
clust = cutree(hc, h = max_dist)) %>%
mutate(r = ref[clust])
df2 %>%
filter(!is.na(r))
#> word clust r
#> 1 cake 1 cake
#> 2 brownies 2 brownies
#> 3 cak 1 cake
#> 4 cakee 1 cake
#> 5 cake 1 cake
#> 6 rownies 2 brownies
#> 7 browwnies 2 brownies