1

I'm wondering if there's an easy way to compare columns before doing a join in dplyr. Below are two simple dataframes. I want to join based on first and last names, however there are some spelling mistakes or different formats, such as "Elizabeth Ray" vs "Elizabeth".

I would like to compare these columns before joining. Is there a way to use a match function or set operation, like intersect, etc, to look for names that don't have matches in both columns? I just want a list of the names that are different so I can manually correct them before joining.

I would like a solution based on dplyr, tidyr, and stringr.

FirstNames<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames<-c("MacDougall","Shapiro","Yamazaki","Murphy","Elizabeth Ray")
Pets<-c("Cat","Dog","Cat","Dog","Cat")
Names1<-data.frame(FirstNames,LastNames,Pets)

FirstNames2<-c("Chris","Doug","Shintaro","Bubbles","Elsa")
LastNames2<-c("MacDougal","Shapiro","Yamazaku","Murphy","Elizabeth")
Dwelling<-c("House","House","Apartment","Condo","House")
Names2<-data.frame(FirstNames2,LastNames2,Dwelling)
Mike
  • 2,017
  • 6
  • 26
  • 53
  • You may find these base R functions useful: https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html – Mekki MacAulay Jun 10 '16 at 18:10
  • Package dplyr also has built-in set functions. The bottom right corner of this cheat sheet details their usage in a straightforward manner: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf – Mekki MacAulay Jun 10 '16 at 18:12
  • The are the following functions available in base R: `union`, `intersect` and `setdiff` – Dave2e Jun 10 '16 at 18:13
  • 2
    `dplyr::anti_join` – alistaire Jun 10 '16 at 18:14
  • anti_join works best so far. Thanks! – Mike Jun 10 '16 at 18:22
  • Lol, I guess the next question is how do I create a function that will then use the results of the anti_join to replace the contents of the columns so they match. I imagine this question is far too complicated so I'll have to figure out how to phrase it within a workable example. – Mike Jun 10 '16 at 18:25

1 Answers1

0

One way of finding the non-matching names is to combine the dataframes into one, and then filter to the names that only appear once:

Names1 %>% 
add_row(Names2 %>% select(FirstNames = FirstNames2, LastNames = LastNames2)) %>% 
select(FirstNames,LastNames) %>%
filter(n() == 1, .by= c(FirstNames, LastNames)) %>%
ungroup %>%
arrange(LastNames, FirstNames)

  FirstNames LastNames    
  <chr>      <chr>        
1 Elsa       Elizabeth    
2 Elsa       Elizabeth Ray
3 Chris      MacDougal    
4 Chris      MacDougall   
5 Shintaro   Yamazaki     
6 Shintaro   Yamazaku     

To replace the bad values, so the dataframes completely match, it requires first having some way of saying what the correct spelling is meant to be (a challenging task, given the number of Mackhayelah-variations there are in the world). This could be by having a list of correct spellings, or it could be by one dataframe having the correct spelling, and the other one using a partial match to match on the correct one, then overwriting the bad value. Below is an implementation of the first approach:

dict <- list("Elizabeth Ray" = "Elizabeth", "Yamazaki" = "Yamazaku", "MacDougall" = "MacDougal")

# replace bad values
Names1 <- Names1 %>%
mutate( FirstNames = ifelse(FirstNames %in% names(dict), dict[FirstNames], FirstNames),
         LastNames = ifelse(LastNames %in% names(dict), dict[LastNames], LastNames)%>% unlist()
        )
Names2 <- Names2 %>%
          mutate( FirstNames = ifelse(FirstNames2 %in% names(dict), dict[FirstNames2], FirstNames2),
        LastNames = ifelse(LastNames2 %in% names(dict), dict[LastNames2], LastNames2)
        ) %>%
      select(FirstNames, LastNames, Dwelling)

# join
Names1 %>%
left_join(Names2, by = c("FirstNames", "LastNames"))

FirstNames LastNames Pets  Dwelling
1      Chris MacDougal  Cat     House
2       Doug   Shapiro  Dog     House
3   Shintaro  Yamazaku  Cat Apartment
4    Bubbles    Murphy  Dog     Condo
5       Elsa Elizabeth  Cat     House
Mark
  • 7,785
  • 2
  • 14
  • 34