merge 2 data.tables on partial match

Question

I have data.table 1 like:

entidad                 municipio                   col1   
AGUASCALIENTES          Aguascalientes              5000
PUEBLA                  Jes�s Mar�a                 3000
VALLE                   Calvillo                    2000
JALISCO                 Pabell�n de Arteaga         1000

and data.table 2 like :

entidad                 municipio                   col2   
AGUASCALIENTES          Aguascalientes              16369
PUEBLA                  Jesus Maria                 1687
VALLE                   Calvillo                    916
JALISCO                 Pabellon de Arteaga         774

and I would like to perform a merge using the columns entidad and municipio like merge(dt1, dt2, by = c('entidad', 'municipio')) using a partial match on both columns or just in column municipio. However the problem is that data.table 1 has accents and the strange symbol � and thus the municipio column might differ on 3 different cases 1) by accent, 2) by symbol � or slightly different name (ej 'ciudad juarez' in dt1 and 'Juárez' in dt2). Since I am not very familiar with working with strings/character columns I was wondering if someone could help me in figuring out how to perform a merge in this case? the result that I would like to obtain is the following, from the previous example:

entidad                 municipio                   col1    col2
AGUASCALIENTES          Aguascalientes              5000    16369
PUEBLA                  Jesus Maria                 3000    1687
VALLE                   Calvillo                    2000    916
JALISCO                 Pabellon de Arteaga         1000    774

score 2 · Accepted Answer · answered Aug 24 '20 at 19:26

You could use the stringdist_join-function from the fuzzyjoin-package for that:

library(fuzzyjoin)
stringdist_join(d1, d2,
                by = c("entidad","municipio"),
                max_dist = 2,
                method = "lv")

this gives:

       entidad.x         municipio.x col1      entidad.y         municipio.y  col2
1 AGUASCALIENTES      Aguascalientes 5000 AGUASCALIENTES      Aguascalientes 16369
2         PUEBLA         Jes�s Mar�a 3000         PUEBLA         Jesus Maria  1687
3          VALLE            Calvillo 2000          VALLE            Calvillo   916
4        JALISCO Pabell�n de Arteaga 1000        JALISCO Pabellon de Arteaga   774

After that you can use dplyr::select to retain the column you need.

merge 2 data.tables on partial match

1 Answers1