I've got a database with free text fields that I want to use to filter
a data.frame
or tibble
. I could perhaps with lots of work create a list of all possible misspellings of my search terms that currently occur in the data (see example of all the spellings I had of one term below) and then I could just use stringr::str_detect
as in the example code below. However, this will not be safe when there might be more misspellings in the future. If I'm willing to accept some limitations / make some assumptions (e.g. how far the edit distance between the misspellings could be, or in terms of some other difference, that people won't use completely different terms etc.), is there some simple solution for doing a fuzzy version of str_detect
?
As far as I could see the obvious packages like stringdist
do not seem to have a function that directly does this. I guess I could write my own function that applies something like stringdist::afind
or stringdist::amatch
to each element of a vector and post-processes the results to eventually return a vector of TRUE
or FALSE
booleans, but I wonder whether this function does not exist somewhere (and is more efficiently implemented than I would do it).
Here's an example that illustrates how with str_detect
I might miss one row I would want:
library(tidyverse)
search_terms = c("preclinical", "Preclincal", "Preclincial", "Preclinial",
"Precllinical", "Preclilnical", "Preclinica", "Preclnical",
"Peclinical", "Prclinical", "Peeclinical", "Pre clinical",
"Precclinical", "Preclicnial", "Precliical", "Precliinical",
"Preclinal", "Preclincail", "Preclinicgal", "Priclinical")
example_data = tibble(project=c("A111", "A123", "B112", "A224", "C149"),
disease_phase=c("Diabetes, Preclinical", "Lipid lowering, Perlcinical",
"Asthma, Phase I", "Phase II; Hypertension", "Phase 3"),
startdate = c("01DEC2018", "17-OKT-2017", "11/15/2019", "1. Dezember 2004", "2005-11-30"))
# Finds only project A111, but not A123
example_data %>%
filter(str_detect(tolower(disease_phase), paste0(tolower(search_terms), collapse="|")))