look for common entries in two data frames

Question

df1:
 state group species
1 CA 2 cat, dog, chicken, mouse
2 CA 1 cat
3 NV 1 dog, chicken
4 NV 2 chicken
5 WA 1 chicken, rat, mouse, lion
6 WA 2 dog, cat
7 WA 3 dog, chicken
8 WA 4 cat, chicken

df2:
 state special_species
1 CA cat
2 CA chicken
3 CA mouse
4 WA cat
5 WA chicken
6 NV dog

I'm interested in determining which of the special_species from df2 are present in df1. I'd like a new dataframe that has the state, group, and the special_species.

Anticipated output:

state group special_species
CA 2 cat, chicken, mouse
CA 1 cat
NV 1 dog
NV 2 NA
WA 1 chicken
WA 2 cat
WA 3 chicken
WA 4 cat, chicken

Can you show us what you anticipate the output to be? This would help to close the loop and confirm understanding of the question you ask. — ErrorJordan, Dec 03 '19 at 17:48
just updated the post with my anticipated output - thank you! — nso921, Dec 03 '19 at 18:01

MSR · Answer 1 · 2019-12-03T19:01:16.597

This was harder than I thought it would be. I think the following works, but hopefully someone can come up with something prettier.

First we make some data to work with (please do this yourself in the future), which I include if anyone else wants a go:

library(tidyverse)

df1 <- tribble(
  ~state, ~group, ~species,
  "CA", 2, "cat, dog, chicken, mouse",
  "CA", 1, "cat",
  "NV", 1, "dog, chicken",
  "NV", 2, "chicken",
  "WA", 1, "chicken, rat, mouse, lion",
  "WA", 2, "dog, cat",
  "WA", 3, "dog, chicken",
  "WA", 4, "cat, chicken")

df2 <- tribble(
  ~state, ~special_species,
  "CA", "cat",
  "CA", "chicken",
  "CA", "mouse",
  "WA", "cat",
  "WA", "chicken",
  "NV", "dog")

Then the solution is:

df1 %>% 
  separate_rows(species) %>% 
  full_join(df2, on = "state") %>%
  filter(species == special_species) %>%
  group_by(state, group) %>%
  summarise(species = paste(special_species, collapse = ", ")) %>%
  full_join(df1, by = c("state" = "state", "group" = "group")) %>%
  select(state, group, special_species = species.x) %>%
  arrange(state)
#> Joining, by = "state"
#> # A tibble: 8 x 3
#> # Groups:   state [3]
#>   state group special_species    
#>   <chr> <dbl> <chr>              
#> 1 CA        1 cat                
#> 2 CA        2 cat, chicken, mouse
#> 3 NV        1 dog                
#> 4 NV        2 <NA>               
#> 5 WA        1 chicken            
#> 6 WA        2 cat                
#> 7 WA        3 chicken            
#> 8 WA        4 cat, chicken

The code can be significantly simplified if you would accept the desired output in slightly different format, e.g. the following is correct save the NA:

df1 %>% 
  separate_rows(species) %>% 
  full_join(df2, on = "state") %>%
  filter(species == special_species) %>%
  group_by(state, group) %>%
  summarise(species = paste(special_species, collapse = ", "))
#> Joining, by = "state"
#> # A tibble: 7 x 3
#> # Groups:   state [3]
#>   state group species            
#>   <chr> <dbl> <chr>              
#> 1 CA        1 cat                
#> 2 CA        2 cat, chicken, mouse
#> 3 NV        1 dog                
#> 4 WA        1 chicken            
#> 5 WA        2 cat                
#> 6 WA        3 chicken            
#> 7 WA        4 cat, chicken

^{Created on 2019-12-03 by the reprex package (v0.3.0)}

score 1 · Answer 2 · answered Dec 03 '19 at 18:58

Here's a data.table implementation that defines a function to find matches row by row. There's probably a more efficient solution to this problem, but here's one possibility:

# Import the data.table package
library(data.table)

df1 <- data.frame(state = c("CA", "CA", "NV", "NV", "WA", "WA", "WA", "WA"), group = c(2, 1, 1, 2, 1, 2, 3, 4), species = c("cat, dog, chicken, mouse", "cat", "dog, chicken", "chicken", "chicken, rat, mouse, lion", "dog, cat", "dog, chicken", "cat, chicken"))
df2 <- data.frame(state = c("CA", "CA", "CA", "WA", "WA", "NV"), special_species = c("cat", "chicken", "mouse", "cat", "chicken", "dog"))

# Convertint to data table
df1 <- as.data.table(df1)
df2 <- as.data.table(df2)

# Create a function to find matches and return the relevant species
# Steps through df1 row by row
fn_find_matches <- function(sel_row){
  # Get the relevant row information
  comp_row <- df1[sel_row]
  species <- trimws(unlist(strsplit(as.vector(comp_row$species), ",")))

  # Retrieve the relevant df2 information for the state
  comp_tbl <- df2[state == comp_row$state]
  species <- species[species %in% comp_tbl$special_species]

  # If there are no mathcing species, return NA
  if(length(species > 0)){
    comp_row$species <- paste(species, collapse = ", ")
  } else {
    comp_row$species <- NA
  }
  return(comp_row)

}
# Create a resulting table
result_table <- rbindlist(lapply(c(1:nrow(df1)), fn_find_matches))
# Convert back to data frame if desired
setDF(result_table)
setDF(df1)
setDF(df2)

Glad to see I wasn't the only one who found this non-trivial. — MSR, Dec 03 '19 at 18:59
There may be a way to use `merge` to accomplish this, but I'm just not sure off the top of my head. Good luck! — sehock, Dec 03 '19 at 19:23

score 1 · Answer 3 · answered Dec 03 '19 at 19:10

This is based on MSR's answer. We can use semi_join to simplify the code.

library(tidyverse)

df3 <- df1 %>%
  separate_rows(species) %>%
  semi_join(df2, by = c("state", "species" = "special_species")) %>%
  group_by(state, group) %>%
  summarize(species = toString(species)) %>%
  ungroup() %>%
  complete(state, group = full_seq(group, period = 1)) %>%
  semi_join(df1, by = c("state", "group"))
df3
# # A tibble: 8 x 3
#   state group species            
#   <chr> <dbl> <chr>              
# 1 CA        1 cat                
# 2 CA        2 cat, chicken, mouse
# 3 NV        1 dog                
# 4 NV        2 NA                 
# 5 WA        1 chicken            
# 6 WA        2 cat                
# 7 WA        3 chicken            
# 8 WA        4 cat, chicken

DATA

library(tidyverse)

df1 <- tribble(
  ~state, ~group, ~species,
  "CA", 2, "cat, dog, chicken, mouse",
  "CA", 1, "cat",
  "NV", 1, "dog, chicken",
  "NV", 2, "chicken",
  "WA", 1, "chicken, rat, mouse, lion",
  "WA", 2, "dog, cat",
  "WA", 3, "dog, chicken",
  "WA", 4, "cat, chicken")

df2 <- tribble(
  ~state, ~special_species,
  "CA", "cat",
  "CA", "chicken",
  "CA", "mouse",
  "WA", "cat",
  "WA", "chicken",
  "NV", "dog")

look for common entries in two data frames

3 Answers3