1
library(tidyverse)

data <- tibble(city =c('Montreal','Montréal','Ottawa','Ottawa','New York','Newyork','New-York'),
value = 1:7)

data%>%
group_by(city)%>%
summarise(mean = mean(value))

and I'd like to obtain something like that but unfortunately it creates 6 groups when in fact there are 3 cities. I have a far larger data set and I was wondering how we could use fuzzy string matching to find the solution. Is there a way to automate this because my data has thousands of observations...

olivroy
  • 548
  • 3
  • 13

2 Answers2

0

Yes, there is a partial solution at least. It's called agrep() and is based on the Levenshtein distance. You can use it like this.

A = c("New-York", "newyork" ,"Montreal", "montreal")

agrep("[nN]ewyork", A, ignore.case = FALSE, value = FALSE,
      max.distance = 0.5)
[1] 1 2

The parameter max.distance controls the acceptance tolerance, i.e. a higher value leads to more (also erroneous) matches and a very low value to likely only the perfect match.

Note that it is not a perfect solution, though, as it fails when the variations get crazier, e.g.

A = c("New-York", "newyork", "NEW--YORK"  ,"Montreal", "montreal")
agrep("[nN]ewyork", A, ignore.case = FALSE, value = FALSE,
      max.distance = 0.5)
[1] 1 2
Taufi
  • 1,557
  • 8
  • 14
0

My package, zoomerjoin provides the function lsh_string_group, which allows you to fuzzily group strings in the manner you describe. The package is developed for larger datasets so it comes with some hyper-parameters that can be cumbersome for smaller datasets, but it should be able to solve the issue you describe as follows:

library(tidyverse)
library(zoomerjoin)

data <- tibble(city =c('Montreal','Montréal','Ottawa','Ottawa','New York','Newyork','New-York'),
value = 1:7)

# LSH method provided by zoomerjoin - ideal for extremely large datasets
data %>%
    mutate(
           grouped_str = jaccard_string_group(
                                          city,
                                          threshold = .6,
                                          n_gram_width = 1,
                                          n_bands = 200)
           ) %>%
    group_by(grouped_str) %>%
    summarise(mean = mean(value))

# A tibble: 3 × 2
#   grouped_str  mean
#   <chr>       <dbl>
# 1 Montreal      1.5
# 2 New York      6
# 3 Ottawa        3.5