I have two data frames with department names similar to these ones:
d1 <- data.frame(depto=c("antioquia", "arauca", "arauca", "cauca", "popayan cauca", "guayabal cundinamarca", "cundinamarca", "cundinamarca", "fresno - tolima", "tolima", "santander", "norte santander"))
d2 <- data.frame(depto=c("Antioquia", "Arauca", "Cauca", "Cundinamarca", "Vichada", "Tolima", "Norte de Santander", "Valle del Cauca", "Santander"), id=c(1,2,3,4,5,6,7,8,9))
The variables "depto" are suppose to be the same but with some differences. I tried using stringdist to match the two data frames.
stringdist_left_join(d1, d2, by ="depto", distance_col = NULL)
with the following result:
1 antioquia Antioquia 1
2 arauca Arauca 2
3 arauca Cauca 3
4 arauca Arauca 2
5 arauca Cauca 3
6 cauca Arauca 2
7 cauca Cauca 3
8 popayan cauca <NA> NA
9 guayabal cundinamarca <NA> NA
10 cundinamarca Cundinamarca 4
11 cundinamarca Cundinamarca 4
12 fresno - tolima <NA> NA
13 tolima Tolima 6
14 santander Santander 9
15 norte santander <NA> NA
I would like to know a way to improve this. The first problem is the departments of Cauca and Arauca are always matched as being the same. The second problem is that some departments in d1 include the municipality (e.g. "guayabal cundinamarca") but I would like the matching method to know that if there is some additional word(s) but the department name is also included in the string to match it (in this case with Cundinamarca). The third problem is that some department names have an equal word but are different (e.g. Norte de Santander and Santander) and sometimes are not matched. Thank you