0

I have two data frames with department names similar to these ones:

d1 <- data.frame(depto=c("antioquia", "arauca", "arauca", "cauca", "popayan cauca", "guayabal cundinamarca", "cundinamarca", "cundinamarca", "fresno - tolima", "tolima", "santander", "norte santander"))
d2 <- data.frame(depto=c("Antioquia", "Arauca", "Cauca", "Cundinamarca", "Vichada", "Tolima", "Norte de Santander", "Valle del Cauca", "Santander"), id=c(1,2,3,4,5,6,7,8,9))

The variables "depto" are suppose to be the same but with some differences. I tried using stringdist to match the two data frames.

stringdist_left_join(d1, d2,  by ="depto",  distance_col = NULL)

with the following result:

1              antioquia    Antioquia  1
2                 arauca       Arauca  2
3                 arauca        Cauca  3
4                 arauca       Arauca  2
5                 arauca        Cauca  3
6                  cauca       Arauca  2
7                  cauca        Cauca  3
8          popayan cauca         <NA> NA
9  guayabal cundinamarca         <NA> NA
10          cundinamarca Cundinamarca  4
11          cundinamarca Cundinamarca  4
12       fresno - tolima         <NA> NA
13                tolima       Tolima  6
14             santander    Santander  9
15       norte santander         <NA> NA

I would like to know a way to improve this. The first problem is the departments of Cauca and Arauca are always matched as being the same. The second problem is that some departments in d1 include the municipality (e.g. "guayabal cundinamarca") but I would like the matching method to know that if there is some additional word(s) but the department name is also included in the string to match it (in this case with Cundinamarca). The third problem is that some department names have an equal word but are different (e.g. Norte de Santander and Santander) and sometimes are not matched. Thank you

user2246905
  • 1,029
  • 1
  • 12
  • 31

1 Answers1

2

If the depto columns are generally reliable (i.e. you don't have major problems around data entry like bad spelling), you can use regex_left_join from the fuzzyjoin package:

library(fuzzyjoin)
d1 <- data.frame(depto=c("antioquia", "arauca", "arauca", "cauca", "popayan cauca", "guayabal cundinamarca", "cundinamarca", "cundinamarca", "fresno - tolima", "tolima", "santander", "norte santander"))
d2 <- data.frame(depto=c("Antioquia", "Arauca", "Cauca", "Cundinamarca", "Vichada", "Tolima", "Norte de Santander", "Valle del Cauca", "Santander"), id=c(1,2,3,4,5,6,7,8,9))
fuzzyjoin::regex_left_join(d1, d2,  by ="depto", ignore_case = TRUE)

Output:

                 depto.x      depto.y id
1              antioquia    Antioquia  1
2                 arauca       Arauca  2
3                 arauca       Arauca  2
4                  cauca        Cauca  3
5          popayan cauca        Cauca  3
6  guayabal cundinamarca Cundinamarca  4
7           cundinamarca Cundinamarca  4
8           cundinamarca Cundinamarca  4
9        fresno - tolima       Tolima  6
10                tolima       Tolima  6
11             santander    Santander  9
12       norte santander    Santander  9
henryn
  • 1,163
  • 4
  • 15
  • Thank you, this is a good solution. The only problem is Norte de Satander which was matched with Santander. – user2246905 Jul 26 '21 at 17:23
  • 1
    I think it will be really hard to find a method clever enough to match `guayabal cundinamarca` with `Cundinamarca` but not `norte santander` with `Santander`. You could try tweaking the `stringdist` approach by using different methods and similarity thresholds but that may not be robust for new data/unknown values. The only other alternative I can think of is creating some deterministic rules when these kinds of values come up. – henryn Jul 27 '21 at 09:39