2

I am trying to find out if 3 datasets (df1, df2, df3) have any common rows (i.e. entire row is a duplicate).

I figured out how to do this for pairs of 2 datasets:

df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"))

df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"))

df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"))


library(dplyr)

inner_join(df1, df2)

inner_join(df1, df3)

inner_join(df2, df3)
  • Is it possible to do this for 3 datasets all at once?

The straightforward way does not seem to work:

inner_join(df1, df2, df3)
Error in `[.data.frame`(by, c("x", "y")) : undefined columns selected

I thought I had found a way to do this:

library(plyr)
join_all(list(df1, df2, df3), type='inner')

But this is telling me that there are no common rows (i.e. same id, same name) between these 3 dataframes:

Joining by: id, names
Joining by: id, names
[1] id    names
<0 rows> (or 0-length row.names)

This is not correct, seeing as in the example I created:

  • Row 3 for df1 and df2 are identical (id = 3,name = peter)
  • Row 2 for df2 and df3 are identical (id = 2, name= john)

I am trying to find a way to determine if these 3 datasets share any common rows. Can this be done in R?

Thank you!

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • So you're trying to find out if there are any lines that appear in two or more dataframes? Your last idea finds if there are lines that appear in all three dataframes. A simple solution would be to just join all three and seek any duplicates – iod Apr 07 '22 at 18:50

4 Answers4

2

You can do this using get_dupes from janitor package.

library(tidyverse)
library(janitor)

# Added a new column 'df_id' to identify the data frame
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"), df_id = 1) 
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"), df_id = 2)
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"), df_id = 3)

# Bind dataframes
# Get duplicates
df1 %>% 
  bind_rows(df2) %>% 
  bind_rows(df3) %>% 
  get_dupes(c(id, names))

#>   id names dupe_count df_id
#> 1  2  john          2     2
#> 2  2  john          2     3
#> 3  3 peter          2     1
#> 4  3 peter          2     2
Celso Silva
  • 136
  • 5
2

Here is how you could achieve your task:

library(dplyr)

bind_rows(df1, df2, df3) %>% 
  group_by(id, names) %>% 
  filter(n()>1) %>% 
  unique()
     id names
  <dbl> <chr>
1     3 peter
2     2 john 
TarJae
  • 72,363
  • 6
  • 19
  • 66
1

Does this count?

dfall<-bind_rows(df1,df2,df3)
dfall[duplicated(dfall),]
  id names
6  3 peter
8  2  john
iod
  • 7,412
  • 2
  • 17
  • 36
1

A possible solution (in case you want a dataframe as result, just pipe bind_rows at the end):

library(dplyr)

combn(paste0("df", 1:3), 2, simplify = F, \(x) inner_join(get(x[1]), get(x[2]))) 

#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> [[1]]
#>   id names
#> 1  3 peter
#> 
#> [[2]]
#> [1] id    names
#> <0 rows> (or 0-length row.names)
#> 
#> [[3]]
#>   id names
#> 1  2  john
PaulS
  • 21,159
  • 2
  • 9
  • 26