0

I have a very large dataset, which looks like this.

I have two types of data frames

  1. my reference data.frame
ref=c("cake","brownies")

and my experimental data.frame

expr=c("cak","cakee","cake", "rownies","browwnies")

I want to match the ref and expr data.frames and find the levenstein distance between them. The output could look like this...

ref   expr      distance 
cake  cak         1
cake  cakee       1
cake  cake        0
cake  rownies    ...

after I have measured their levenstein distance I want to cluster any string that has distance less than 3 to one cluster and my data to maybe look like

ref        expr      distance  cluster
cake       cak         1         1
cake       cakee       1         1
cake       cake        0         1
brownies   rownies     1         2 
brownies   browwnies   1         2

any help or advice on how to move on is appreciate it. At the moment I am trying a lot of R packages to find the distance between data.frame such as

library("DescTools")

but they do not seem to work well.

camille
  • 16,432
  • 18
  • 38
  • 60
LDT
  • 2,856
  • 2
  • 15
  • 32
  • 4
    Googling for "string" and "distance" revealed an aptly-named [`stringdist`](https://cran.r-project.org/web/packages/stringdist/index.html) function that coincidentally calculates the Levenshtein distance. I'd start with that. As far as grouping them together, it might be as simple as grouping with bins of `distance`, not sure. – r2evans Dec 16 '21 at 19:54
  • Thank you Evans. Yes I have worked a lot with this package and the amatch function but no luck yet. Thank for your time and for having a look – LDT Dec 16 '21 at 20:00
  • 1
    Like you want a cluster per ref word? It would be helpful to see examples with distances outside the range you're talking about (i.e. distance >= 3) – camille Dec 16 '21 at 20:02
  • Exactly! I want to cluster together strings that have a distance less than 3 to the reference. – LDT Dec 16 '21 at 20:05
  • 1
    Can you include what you did to get the distances? I second the recommendation for stringdist, I've used it for similar projects – camille Dec 16 '21 at 20:09
  • Thank you Camile. I calculate the distance by eye. I can see how stringdist works when you compare one reference string to others but not multiple references. Anyway I guess to look deeper – LDT Dec 16 '21 at 20:14
  • 1
    FYI @r2evans and I have both removed the [tag:tidyverse] tag because even though you might want to use some of those packages (none of them are in the question), the tag has pretty specific instructions for its use: "DO NOT USE if your question relates to one or two components of the tidyverse, such as dplyr or ggplot2. Use *those* tags, and tag with `r` as well for a better response. ...Unless your question is about the entirety of the tidyverse package, its installation or its integration with your system, use tags for the packages you are actually using." – camille Dec 16 '21 at 21:35

1 Answers1

1

Here are 2 ways I'd approach it, one that's strictly supervised and more manual, and another that takes a less supervised route. The package stringdist has a bunch of different distance metrics, where "lv" is Levenshtein. I added an additional observation "poundcake" to test with a word that's too far from the reference words.

Option 1

Get a matrix of the distances between each experimental string and one of the reference strings. This could have issues if you have 2 similar reference strings, or if an experimental word is equally close to 2 references, but it works for this simple case. Then reshape the matrix into a data frame, and count along reference words to get cluster numbers. Filter for cases where the distance is less than your threshold.

library(dplyr)
library(stringdist)

max_dist <- 3

ref <- c("cake", "brownies")
expr <- c("cak", "cakee", "cake", "poundcake", "rownies","browwnies")

mtx <- stringdistmatrix(ref, expr, method = "lv", useNames = "strings")

mtx
#>          cak cakee cake poundcake rownies browwnies
#> cake       1     1    0         5       6         8
#> brownies   8     7    7         8       1         1

df1 <- as.data.frame(mtx) %>%
  tibble::rownames_to_column("ref") %>%
  tidyr::pivot_longer(-ref, names_to = "expr", values_to = "dist") %>%
  mutate(clust = as.numeric(forcats::as_factor(ref))) # could also use data.table::rleid

df1 %>%
  filter(dist <= max_dist)
#> # A tibble: 5 × 4
#>   ref      expr       dist clust
#>   <chr>    <chr>     <dbl> <dbl>
#> 1 cake     cak           1     1
#> 2 cake     cakee         1     1
#> 3 cake     cake          0     1
#> 4 brownies rownies       1     2
#> 5 brownies browwnies     1     2

Option 2

This might work for more complex cases. I've used it for correcting the spelling of people's names, where I have an incomplete set of correct labels to work from. Combine all the words into 1 vector, get a distance matrix (this time it will be square), then create clusters from hierarchical clustering using the threshold as the height to cut the tree. You can then match the reference for each word to get labels for the clusters.

The downside here is that you have rows for reference words that weren't experimental—note for example that "brownies" was never spelled correctly in the experimental strings, but now you have that observation.

all_words <- c(ref, expr)
hc <- hclust(stringdistmatrix(all_words, method = "lv", useNames = "strings"))

df2 <- data.frame(word = c(ref, expr), 
                  clust = cutree(hc, h = max_dist)) %>%
  mutate(r = ref[clust])

df2 %>%
  filter(!is.na(r))
#>        word clust        r
#> 1      cake     1     cake
#> 2  brownies     2 brownies
#> 3       cak     1     cake
#> 4     cakee     1     cake
#> 5      cake     1     cake
#> 6   rownies     2 brownies
#> 7 browwnies     2 brownies
camille
  • 16,432
  • 18
  • 38
  • 60