How to fuzzy match by words (not letters) in R?

Question

I need to merge two datasets based on columns that contain names that don't exaclty match, sometimes because one of the columns has a missing name with respect to the other. For example, in one column I have "Martín Gallardo" and in the other I have "Martín Ricardo Gallardo". Another problem is that in some first and last name appear reversed, like "Martín Gallardo" in one and "Gallardo Martín" in the other. How can I match this using R? My first thought was to use str_split in both and assign each on one set to the one that matches more elements from the other set, but I'm not sure how to code this.

Thank you.

Edit: data looks something like this

A <- tibble(email=c("martingallardo23@gmail.com","raulgimenez@gmail.com"), 
name=c("martin", "raul"), last_name=c("gallardo","gimenez"), 
full_name=c("martin gallardo", "raul gimenez"))
A
#  A tibble: 2 x 4
#   email                      name   last_name full_name
#   <chr>                      <chr>  <chr>     <chr>          
# 1 martingallardo23@gmail.com martin gallardo  martin gallardo
# 2 raulgimenez@gmail.com      raul   gimenez   raul gimenez   

B <- tibble(email=c("martingallardo@gmail.com", "raulgimenez2@gmail.com"), 
name=c("martin ricardo", "gimenez"), last_name=c("gallardo", "raul"), 
full_name=c("martin ricardo gallardo", "gimenez raul"), other_data=c("A", "B"))
B
# A tibble: 2 x 5
#   email                    name           last_name full_name              other_data
#   <chr>                    <chr>          <chr>     <chr>                   <chr>     
# 1 martingallardo@gmail.com martin ricardo gallardo  martin ricardo gallardo A         
# 2 raulgimenez2@gmail.com   gimenez        raul      gimenez raul            B

Could you please share a reproducible piece of your data? So that others can help you more efficiently. — Anoushiravan R, Apr 01 '21 at 15:10
@AnoushiravanR There I added an example of the data, does that work? I want to match A to B using full_name. There is also an email available but does not always match. — Martin, Apr 01 '21 at 15:24
@MartinGallardo you then want to merge 2 data sets? I mean based on those rows that match. — Anoushiravan R, Apr 01 '21 at 15:46

score 2 · Accepted Answer · answered Apr 01 '21 at 15:49

2

This is a tidyverse way to do the join. It basically finds full_name from B that has the highest number of common words with A. library(tidyverse)

A1 <- tibble(
  nombre_completo = c("martin gallardo", "raul gimenez")
  ) %>%
  mutate(
    id_A = row_number()
  )

B1 <- tibble(
  nombre_completo=c("martin ricardo gallardo", "gimenez raul"),
  other_data=c("A", "B")
  ) %>%
  mutate(
    id_B = row_number()
  )


A2 <- A1 %>%
  mutate(
    name_words = str_split(nombre_completo, pattern = " ")
  ) %>%
  unnest(cols = c(name_words))

B2 <- B1 %>%
  mutate(
    name_words = str_split(nombre_completo, pattern = " ")
  ) %>%
  unnest(cols = c(name_words)) %>%
  select(name_words, id_B )


left_join(A2, B2, by = "name_words") %>%
  group_by(nombre_completo, id_A, id_B) %>%
  count() %>% ungroup() %>%
  group_by(nombre_completo, id_A) %>%
  slice_max(order_by = n) %>%
  select("nombre_completo_A" = nombre_completo, id_A, id_B) %>%
  left_join(B1, by = "id_B")

answered Apr 01 '21 at 15:49

Jakub.Novotny

2,912
2
6
21

This is very good, thank you so much. Is there a way to get only the best fit in the end? I'm getting one nombre_completo_A matched to several from nombre_completo, all of those that share at least one word. Again, thank you! – Martin Apr 01 '21 at 16:06
Actually I think I got it. Thank you! – Martin Apr 01 '21 at 16:26
You’re welcome. Feel free to upvote/mark the answer as accepted. – Jakub.Novotny Apr 01 '21 at 16:28
Nice idea, you certainly deserve an upvote dear @Jakub.Novotny. – Anoushiravan R Apr 01 '21 at 20:38

score 1 · Answer 2 · answered Apr 01 '21 at 16:33

In order for these two data sets to be matched I first created a column nombre_completo2 in a restructured form of data set A based on how nombre_completo in data set A partially match the same column in data set B. Then I merged the two data sets so that the additional columns in data set B is added to the restructured form of A. This is how I interpreted your desired output in the first place so I hope it will be useful to you:

A <- tibble(email=c("martingallardo23@gmail.com","raulgimenez@gmail.com"), 
            name=c("martin", "raul"), last_name=c("gallardo","gimenez"), 
            nombre_completo=c("martin gallardo", "raul gimenez"))


B <- tibble(email=c("martingallardo@gmail.com", "raulgimenez2@gmail.com"), 
            name=c("martin ricardo", "gimenez"), last_name=c("gallardo", "raul"), 
            nombre_completo=c("martin ricardo gallardo", "gimenez raul"), 
            other_data=c("A", "B"))

library(dplyr)
library(tidyr)
library(purrr)

A %>%
  rowwise() %>%
  mutate(nombre_completo2 = map_chr(nombre_completo, 
                                ~ B$nombre_completo
                                [str_detect(B$nombre_completo, str_sub(.x, 1L, 4L))])) %>%
  inner_join(B, by = c("nombre_completo2" = "nombre_completo")) %>%
  select(!ends_with(".y")) %>%
  rename_with(~ str_replace(., ".x", ""), ends_with(".x"))


# A tibble: 2 x 6
# Rowwise: 
  email                      name   last_name nombre_completo nombre_completo2       other_data
  <chr>                      <chr>  <chr>     <chr>           <chr>                  <chr>     
1 martingallardo23@gmail.com martin gallardo  martin gallardo martin ricardo gallar~ A         
2 raulgimenez@gmail.com      raul   gimenez   raul gimenez    gimenez raul           B

My pleasure, I would have also liked to test it on a larger data set. Please feel free to upvote or select every answer which meets your requirements. Best wishes. — Anoushiravan R, Apr 01 '21 at 19:23

How to fuzzy match by words (not letters) in R?

2 Answers2