2

I'm looking to something similar to str_detect() from the stringr package, but which is capable of detecting imperfect or "fuzzy" matches. Preferably, I'd like to be able to specify the degree of imperfection (1 different character, 2 different characters, etc.).

The matching I'm doing will take a form similar to the below code (but this is just a simplified example I made up). In the example, only "RUTH CHRIS" gets matched - I'd like something capable of matching the slightly wrong strings as well.

library(tidyverse)

my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
                                        "NEW JERSEY WENDYS",
                                        "8/25/19 RUTH CHRIS",
                                        "MELTINGPO 9823i3")
)

cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")

my_restaurants %>%
  mutate(category = case_when(
    str_detect(restaurant, cheap) ~ "CHEAP",
    str_detect(restaurant, expensive) ~ "EXPENSIVE"
    )) 

So again, this gives this output:

##  A tibble: 4 × 2
#   restaurant                 category 
#   <chr>                      <chr>    
# 1 MCDOlNALD'S ON FRANKLIN ST NA       
# 2 NEW JERSEY WENDYS          NA       
# 3 8/25/19 RUTH CHRIS         EXPENSIVE
# 4 MELTINGPOT 9823i3          NA 

But I want:

## A tibble: 4 × 2
#   restaurant                 category 
#   <chr>                      <chr>    
# 1 MCDOlNALD'S ON FRANKLIN ST CHEAP       
# 2 NEW JERSEY WENDYS          CHEAP       
# 3 8/25/19 RUTH CHRIS         EXPENSIVE
# 4 MELTINGPOT 9823i3          EXPENSIVE 

I'm not against using regex, but my actual data is significantly more complicated than the given example, so I'd prefer something much more concise that allows for general, not specifc, types of fuzziness.

justing
  • 29
  • 4

3 Answers3

4

In Base R, You could do:

cheap <- c("MCDONALD'S", "WENDY'S") 
expensive <- c("RUTH CHRIS", "MELTING POT")

pat <- stack(list(cheap = cheap, expensive = expensive))

transform(my_restaurants, category=pat[sapply(pat$values,agrep,restaurant),2])

                  restaurant  category
1 MCDOlNALD'S ON FRANKLIN ST     cheap
2          NEW JERSEY WENDYS     cheap
3         8/25/19 RUTH CHRIS expensive
4           MELTINGPO 9823i3 expensive
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • I'm glad you generally put a 'In Base R' approach. – Chris Jun 23 '22 at 15:38
  • @Chris therr is always a simpler base R solution. The problem is often because of not knowing the functions. Most of my solutions tend to give the idea that base R can purely be used to solve the problem in a neat way – Onyambu Jun 23 '22 at 16:30
3

You can use fuzzyjoin::stringdist_left_join

cheap <- c("MCDONALD'S", "WENDY'S") 
expensive <- c("RUTH CHRIS", "MELTING POT")

pat <- stack(list(cheap = cheap, expensive = expensive))

fuzzyjoin::stringdist_left_join(my_restaurants, pat, 
      c(restaurant='values'), max_dist=0.45, method = 'jaccard')

# A tibble: 4 x 3
  restaurant                 values      ind      
  <chr>                      <chr>       <fct>    
1 MCDOlNALD'S ON FRANKLIN ST MCDONALD'S  cheap    
2 NEW JERSEY WENDYS          WENDY'S     cheap    
3 8/25/19 RUTH CHRIS         RUTH CHRIS  expensive
4 MELTINGPO 9823i3           MELTING POT expensive
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

The top response to this question clued me in to try agrepl(), which seems to best suit my needs for this project since it is a straightforward substitute for str_detect().

Using my example from above...

my_restaurants <- tibble(restaurant = c("MCDOlNALD'S ON FRANKLIN ST",
                                        "NEW JERSEY WENDYS",
                                        "8/25/19 RUTH CHRIS",
                                        "MELTINGPO 9823i3")
)

cheap <- c("MCDONALD'S", "WENDY'S") %>% str_c(collapse="|")
expensive <- c("RUTH CHRIS", "MELTING POT") %>% str_c(collapse="|")

my_restaurants %>%
  mutate(category = case_when(
    agrepl(cheap, restaurant, 2, fixed=FALSE) ~ "CHEAP",
    agrepl(expensive, restaurant, 2, fixed=FALSE) ~ "EXPENSIVE"
  ))

Gives the output:

# A tibble: 4 × 2
  restaurant                 category 
  <chr>                      <chr>    
1 MCDOlNALD'S ON FRANKLIN ST CHEAP    
2 NEW JERSEY WENDYS          CHEAP    
3 8/25/19 RUTH CHRIS         EXPENSIVE
4 MELTINGPO 9823i3           EXPENSIVE

However, onyambu's solutions also seem to be good alternative methods. They allow for more advanced forms of fuzzy matching than agrepl() is capable of.

justing
  • 29
  • 4