Data consolidation and cleaning using fuzzy string comparisons with -matchit- command

Question

I have two databases, one designated data and another data1 (reference), where I want to compare the codes of each data designation and data2, I have to do it by writing the designations, if they are written the same or similar, I have to have the same code, but he can find more than one line of the dictionary database with the same writings and happen when that wants him to compare the code with the one that has equal or closer words length.

> dados=data.frame(designacao = c("arroz","arroz agulha","arroz agulha","arroz grao medio", "arro","arroz medio","Leite pasteurizado meio gordo"),
+                  codigo = c("11111","11111","11111","11112","11111","11114","1141204"))
> dados1=data.frame(designacao = c("arroz","arroz grao medio longo","arroz grao medio", "Leite pasteurizado meio gordo"),
+                   codigo = c("11111","11113","11112","1141202"))
> dados
                     designacao  codigo
1                         arroz   11111
2                  arroz agulha   11111
3                  arroz agulha   11111
4              arroz grao medio   11112
5                          arro   11111
6                   arroz medio   11114
7 Leite pasteurizado meio gordo 1141204
> dados1
                     designacao  codigo
1                         arroz   11111
2        arroz grao medio longo   11113
3              arroz grao medio   11112
4 Leite pasteurizado meio gordo 1141202

Three possible cases to find: - only one line with the maximum number of words - more than two lines with maximum number of words but different lengths: when this happens, take the line with the closest word length.

more than two lines with the maximum number of words, but equal lengths: when this happens, compare the data designation code with any of the codes on the lines with the maximum number of words and check if the data designation has any code.

> dados
                     designacao  codigo                                         resultado_codigo
1                         arroz   11111                                           Codigo correto
2                  arroz agulha   11111                                           Codigo correto
3                  arroz agulha   11111                                           Codigo correto
4              arroz grao medio   11112                                           Codigo correto
5                          arro   11111                                           Codigo correto
6                   arroz medio   11114 codigo invalido, revise o nome da designacao ou o código
7 Leite pasteurizado meio gordo 1141204 codigo invalido, revise o nome da designacao ou o código

Data consolidation and cleaning using fuzzy string comparisons with -matchit- command

0 Answers0