How to merge based on a string in a column?

Question

I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:

year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")

df1 <- data.frame(year, state, name)

year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("@ george v. SALLY", "@laura v. dAvid")

df2 <- data.frame(year, state, versus)

My preferred output would be the following:

year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("@ george v. SALLY", "@ george v. SALLY", "@laura v. dAvid", "@laura v. dAvid", NA, NA)

df3 <- data.frame(year, state, name, versus)

I've tried variations of the following:

library(fuzzyjoin)

stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")

stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")

And they don't seem to get close to what I want.

I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.

langtang · Accepted Answer · 2022-07-14T19:58:35.603

A simple approach, which depends somewhat on the structure of df2$versus, would be this:

library(dplyr)

left_join(df1,df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))

Output:

  year  state name   versus           
  <chr> <chr> <chr>  <chr>            
1 2002  TN    George @ george v. SALLY
2 2002  TN    Sally  @ george v. SALLY
3 1999  AL    David  @laura v. dAvid  
4 1999  AL    Laura  @laura v. dAvid  
5 1997  CA    John   NA               
6 2002  TN    Kate   NA

Update/Jul 14 2022:

If name has more complicated pattern, rather than a single word (say Molly Home, Jane Doe), we need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:

Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0

f <- function(n,v) {
  wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
  sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}

Left join the original frames, and apply f() by row

left_join(df1,df2, by=c("year","state")) %>% 
  rowwise() %>% 
  mutate(versus:=if_else(f(name, versus), versus,NA_character_))

Output:

1 2002  TN    Molly Homes, Jane Doe Homes (v. Vista)
2 2002  TN    Sally                 NA              
3 1999  AL    David                 @laura v. dAvid 
4 1999  AL    Laura                 @laura v. dAvid 
5 1997  CA    John                  NA              
6 2002  TN    Kate                  NA

Input:

df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997", 
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe", 
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA, 
-6L))

df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "@laura v. dAvid")), class = "data.frame", row.names = c(NA, 
-2L))

What if I care about any whole word? I changed George to "Molly Homes, Jane Doe" and @ george v. SALLY to "Homes (v. Vista)" and they're not merging. What's going on here? — hy9fesh, Jul 14 '22 at 18:42
it definitely requires a different approach if you are trying to match a subset of the `name` (say a whole word within `name`.. Perhaps post a new question, with some example data? — langtang, Jul 14 '22 at 19:38

harre · Answer 2 · 2022-07-15T10:03:26.227

Update 15/07:

See comment. In such case one would want to check for a match in versus for each individual name in name. This could be done like this (using @langtang's 'new' data):

df1 |>
    left_join(df2, by = c("year", "state")) |>
    rowwise() |>
    mutate(versus = if_else(str_detect(tolower(versus), paste0(unlist(str_extract_all(tolower(name), "\\w+")), collapse = "|")), versus, NA_character_)) |>
    ungroup()

Output:

# A tibble: 6 × 4 
  year  state name                  versus          
  <chr> <chr> <chr>                 <chr>           
1 2002  TN    Molly Homes, Jane Doe Homes (v. Vista)
2 2002  TN    Sally                 NA              
3 1999  AL    David                 @laura v. dAvid 
4 1999  AL    Laura                 @laura v. dAvid 
5 1997  CA    John                  NA              
6 2002  TN    Kate                  NA

Old answer:

An approach could be:

library(tidyverse)

df1 |>
  left_join(df2) |>
  group_by(state) |>
  mutate(versus = if_else(str_detect(tolower(versus), tolower(name)), versus, NA_character_)) |>
  ungroup()

Output:

# A tibble: 6 × 4
  year  state name   versus           
  <chr> <chr> <chr>  <chr>            
1 2002  TN    George @ george v. SALLY
2 2002  TN    Sally  @ george v. SALLY
3 1999  AL    David  @laura v. dAvid  
4 1999  AL    Laura  @laura v. dAvid  
5 1997  CA    John   NA               
6 2002  TN    Kate   NA

What if I care about any whole word? I changed George to "Molly Homes, Jane Doe" and @ george v. SALLY to "Homes (v. Vista)" and they're not merging. What's going on here? — hy9fesh, Jul 14 '22 at 18:42

How to merge based on a string in a column?

2 Answers2

Update/Jul 14 2022:

Linked