11

I've got a database with free text fields that I want to use to filter a data.frame or tibble. I could perhaps with lots of work create a list of all possible misspellings of my search terms that currently occur in the data (see example of all the spellings I had of one term below) and then I could just use stringr::str_detect as in the example code below. However, this will not be safe when there might be more misspellings in the future. If I'm willing to accept some limitations / make some assumptions (e.g. how far the edit distance between the misspellings could be, or in terms of some other difference, that people won't use completely different terms etc.), is there some simple solution for doing a fuzzy version of str_detect?

As far as I could see the obvious packages like stringdist do not seem to have a function that directly does this. I guess I could write my own function that applies something like stringdist::afind or stringdist::amatch to each element of a vector and post-processes the results to eventually return a vector of TRUE or FALSE booleans, but I wonder whether this function does not exist somewhere (and is more efficiently implemented than I would do it).

Here's an example that illustrates how with str_detect I might miss one row I would want:

library(tidyverse)

search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial", 
                 "Precllinical", "Preclilnical", "Preclinica", "Preclnical", 
                 "Peclinical", "Prclinical", "Peeclinical", "Pre clinical", 
                 "Precclinical", "Preclicnial", "Precliical", "Precliinical", 
                 "Preclinal", "Preclincail", "Preclinicgal", "Priclinical")

example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
                      disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical", 
                                      "Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
                      startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30")) 

# Finds only project A111, but not A123
example_data %>%
  filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))
Björn
  • 644
  • 10
  • 23
  • Look at [`stringdist_join`](https://cran.r-project.org/web/packages/fuzzyjoin/readme/README.html) – Maël Apr 20 '22 at 13:00
  • @Maël Does that not match on the whole string rather than looking for substrings like in the example where you have "Diabetes, Preclinical" (of course not always just one word, could also be "Diabetes, Pre clinical" or various other things) and want to match this with "Preclinical"? After joining, I'd then also have to discard duplicate rows in case of multiple matches (so perhaps two extra steps such as `dplyr::select(-search_term) %>% distinct()`), but I guess that wouldn't be so bad (so the issue of matching on the whole string would be the main obstacle to using `stringdist_join`). – Björn Apr 20 '22 at 13:21

3 Answers3

8

You can use agrepl for Approximate String Matching (Fuzzy Matching) which is in base.

example_data[agrep(paste(search_terms, collapse = "|"),
  example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

Or using Reduce instead of | in the regex.

example_data[Reduce(\(y, x) y | agrepl(x, example_data$disease_phase, 2,
  ignore.case=TRUE), search_terms, FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

An alternative might be adist, also in base, which calculates a distance matrix - so it might not be recommended for larger vectors, as the matrix can get large. Here I also choose that a mismatch by 2 characters will be OK.

example_data[colSums(adist(unique(search_terms), example_data$disease_phase,
                           partial=TRUE) < 3) > 0,]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

In case only single words are compared it might be more efficient so split the disease_phase into words using strsplit also in base.

. <- strsplit(example_data$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_data[unique(unlist(.[Reduce(\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],
   2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]
#example_data[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
#   tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative
#  project               disease_phase   startdate
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017
#1    A111       Diabetes, Preclinical   01DEC2018

Some simpler examples using agrep:

#Allow 1 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)
#[1]  TRUE  TRUE FALSE

#Allow 2 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)
#[1] TRUE TRUE TRUE

#Use boundaries to match words
agrepl("\\bpreclinical\\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)
#[1]  TRUE FALSE

How much difference will be allowed can be set with max.distance:

max.distance: Maximum distance allowed for a match.  Expressed either
          as integer, or as a fraction of the _pattern_ length times
          the maximal transformation cost (will be replaced by the
          smallest integer not less than the corresponding fraction),
          or a list with possible components

          ‘cost’: maximum number/fraction of match cost (generalized
              Levenshtein distance)

          ‘all’: maximal number/fraction of _all_ transformations
              (insertions, deletions and substitutions)

          ‘insertions’: maximum number/fraction of insertions

          ‘deletions’: maximum number/fraction of deletions

          ‘substitutions’: maximum number/fraction of substitutions

And also a Benchmark based on @JBGruber:

system.time({  #Libraries needed for method of JBGruber
library(dplyr);
library(stringdist);
library(Rfast);
library(tidytext)
})
#       User      System verstrichen 
#      1.008       0.040       1.046 

set.seed(42)
example_large <- example_data %>% sample_n(5000, replace = TRUE)

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

bench::mark(check = FALSE,
  stringdist_detect = {
     example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
  },
  GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")
   . <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
   example_large[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
     tolower(search_terms), FALSE)], FALSE, FALSE)),]
})
#  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 stringdist_detect  17.42ms  18.65ms      52.8    7.15MB    19.4     19     7
#2 GKi                 5.64ms   6.04ms     165.   869.08KB     6.27    79     3

Also much time could be saved when there is only one, right written, variant of the words of interest in search_terms.

GKi
  • 37,245
  • 2
  • 26
  • 48
  • Instead of controlling cost as number of characters, control is as the proportion. ie lets say we want at least 90% match. Then `agrepl('preclinical', search_terms, 0.1, ignore.case = TRUE, fixed = TRUE)` should do the trick. For the data given the proportion drops to around 82 % there fore one should generally look at 80% hence use a cost of 0.2 – Onyambu Apr 27 '22 at 05:39
4

I think the most efficient/fastest way is this:

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

stringdist::stringdistmatrix calculates a distance matrix between all values in a and b. I’ve never heard of Rfast::colMins but some googling tells me it is the fastest way to find the minimum value in each row of a matrix (apply(x, 2, min) would accomplish the same). And that is all we want: the minimum, as it tells us the smallest distance between words in a and b. We can compare this to a threshold value. Look at ?stringdist::stringdist-metrics for more infos on the method argument. I simply followed @shs suggestion, which seems plausible.

Now the second thing I would do is to tokenize the text before comparing distances, as finding misspellings in tokens makes a lot more sense. tidytext::unnest_tokens is a nice function that splits text into words (i.e., tokenization):

example_data %>%
  tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
  filter(stringdist_detect(word, tolower(search_terms)))
## # A tibble: 2 × 4
##   project disease_phase               startdate   word       
##   <chr>   <chr>                       <chr>       <chr>      
## 1 A111    Diabetes, Preclinical       01DEC2018   preclinical
## 2 A123    Lipid lowering, Perlcinical 17-OKT-2017 perlcinical

Tokenisation has the extra advantage that you have a column telling you which word hast been matched. Which should make testing different threshold much easier. However, as @shs suggested, you get some duplication if two misspellings are identified. You can use filter(!duplicated(project)) as in the next part to get rid of duplicated misspelling.

If you don’t want to define your own function, you can also follow @Maël’s suggestion. Here it is spelled out:

search_terms <- data.frame(word = search_terms)
example_data %>%
  tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
  fuzzyjoin::stringdist_inner_join(search_terms, by = "word", max_dist = 2) %>% 
  filter(!duplicated(project))
## # A tibble: 2 × 5
##   project disease_phase               startdate   word.x      word.y     
##   <chr>   <chr>                       <chr>       <chr>       <chr>      
## 1 A111    Diabetes, Preclinical       01DEC2018   preclinical preclinical
## 2 A123    Lipid lowering, Perlcinical 17-OKT-2017 perlcinical preclinical

benchmark

example_large <- example_data %>%
      # upsample for more realistic scenario
      sample_n(5000, replace = TRUE)

res <- bench::mark(
  stringdist_detect = {
     example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
  },
  fuzzyjoin = {
    example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      fuzzyjoin::stringdist_inner_join(data.frame(word = search_terms), by = "word", max_dist = 2) %>% 
      filter(!duplicated(project))
  },
  agrepl = {
    example_large %>% 
      filter(agrepl(paste(search_terms, collapse = "|"), disease_phase, 2, ignore.case=TRUE, fixed=FALSE))
  },
  agrepl_reduce = {
    example_large[Reduce(\(y, x) y | agrepl(x, example_large$disease_phase, 2,
                                           ignore.case=TRUE), search_terms, FALSE),]
  },
  check = FALSE
)
summary(res)
## # A tibble: 4 × 6
##   expression             min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 stringdist_detect   21.3ms   23.3ms     42.8         NA     13.4
## 2 fuzzyjoin           57.4ms   60.1ms     16.8         NA     13.4
## 3 agrepl             224.7ms  226.4ms      4.33        NA      0  
## 4 agrepl_reduce        229ms  229.1ms      4.36        NA      0
summary(res, relative =TRUE)
## # A tibble: 4 × 6
##   expression          min median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>        <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
## 1 stringdist_detect  1      1         9.88        NA      Inf
## 2 fuzzyjoin          2.70   2.59      3.88        NA      Inf
## 3 agrepl            10.6    9.73      1           NA      NaN
## 4 agrepl_reduce     10.8    9.85      1.01        NA      NaN

As you can see, stringdist_detect is the fastest, followed by fuzzyjoin (which uses stringdist under the hood as well). I aso included @GKi's approach using agrepl. On smaller datasets, agrepl was actually faster, but I think you probably have more than the 5 rows in your real dataset. It would not hurt to try these functions in your data and report back.

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • Your `string_dist()` argument `method` is not passed to anything inside the function. – shs Apr 28 '22 at 13:00
  • You are using `stringdistmatrix()` to compare against the list of potential misspellings, although OP stated that he doesn't want an answer that relies on such an enumeration of potential misspellings. This is particularly strange, as you are emphesising perfomance with your solution, which will suffer if you are calculating 20 string distances per word instead of 1. Also, memory use is much greater in this instance. – shs Apr 28 '22 at 13:00
  • Unnesting into a long format with `unnest_tokens()` is a good approach because it lets better benefit from vectorization, and it is the actual reason for the better performance. But in your solution it leads to the risks of generating dulpicatates. A case like `"Preclinical lipid lowering, Preclinical"` is not in the sample data, but not uncommon in human entered data. The nested tokenization in my solution avoids this but is slower. This problem can be avoided in long data by grouping and filtering for the first minimum per group without much of a performance penalty. – shs Apr 28 '22 at 13:01
  • `example_large |> tidytext::unnest_tokens(word, disease_phase, drop = F) |> mutate(str_dist = stringdist(word, "preclinical", method = "dl")) |> filter(str_dist < 4) |> group_by(project, disease_phase) |> slice(which.min(str_dist))` is faster and safer because it calculates only one string distance per word and removes duplicates – shs Apr 28 '22 at 13:03
  • Thanks @shs. Good points. I added notes about duplication and fixed the function (which I had already done for the benchmarks but forgot to put here. I have to admit that I misunderstood the question. I thought the initial list of misspellings was already there and should possibly be updated using the fuzzy matching. Anyway, I still think it makes sense to write the function so it can use a vector of search_terms. – JBGruber Apr 28 '22 at 15:55
3

The Damerau–Levenshtein distance is a good choice for measuring string distance when it comes to typos. In the following piece of code I split the disease_phase and look if any of the substrings match closely with "preclinical".

library(tidyverse)
library(stringdist)

example_data |> 
  filter(str_split(disease_phase, "\\W+") |> 
           map_lgl(\(x) x |> 
                 stringdist("preclinical", "dl") |> 
                 (`<=`)(4) |> # Threshold for distance
                 any()
               )
         )
#> # A tibble: 2 × 3
#>   project disease_phase               startdate  
#>   <chr>   <chr>                       <chr>      
#> 1 A111    Diabetes, Preclinical       01DEC2018  
#> 2 A123    Lipid lowering, Perlcinical 17-OKT-2017

I chose a rather conservative threshold distance of <=4, because as you can see below, your typo examples all fell below that. You may want to do a bit of testing for a good threshold.

stringdist(search_terms, "preclinical")
#>  [1] 0 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 3 3 2 2

Created on 2022-04-23 by the reprex package (v2.0.1)

edit:

As I stated in my comments of JBGruber's answer, going long instead of nested has a significant performance benefit. So better do:

example_large |>
  tidytext::unnest_tokens(word, disease_phase, drop = F) |>
  mutate(str_dist = stringdist(word, "preclinical", method = "dl")) |>
  filter(str_dist < 4) |>
  group_by(project, disease_phase) |>
  slice(which.min(str_dist))

The last two lines are to avoid potential duplicates when "preclinical" appears twice in the same string, which it doesn't in the sample data, but is not unlikely in a large human generated data set.

shs
  • 3,683
  • 1
  • 6
  • 34