Finding Identical Rows in Multiple Datasets

Question

I am trying to find out if 3 datasets (df1, df2, df3) have any common rows (i.e. entire row is a duplicate).

I figured out how to do this for pairs of 2 datasets:

df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"))

df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"))

df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"))


library(dplyr)

inner_join(df1, df2)

inner_join(df1, df3)

inner_join(df2, df3)

Is it possible to do this for 3 datasets all at once?

The straightforward way does not seem to work:

inner_join(df1, df2, df3)
Error in `[.data.frame`(by, c("x", "y")) : undefined columns selected

I thought I had found a way to do this:

library(plyr)
join_all(list(df1, df2, df3), type='inner')

But this is telling me that there are no common rows (i.e. same id, same name) between these 3 dataframes:

Joining by: id, names
Joining by: id, names
[1] id    names
<0 rows> (or 0-length row.names)

This is not correct, seeing as in the example I created:

Row 3 for df1 and df2 are identical (id = 3,name = peter)
Row 2 for df2 and df3 are identical (id = 2, name= john)

I am trying to find a way to determine if these 3 datasets share any common rows. Can this be done in R?

Thank you!

So you're trying to find out if there are any lines that appear in two or more dataframes? Your last idea finds if there are lines that appear in all three dataframes. A simple solution would be to just join all three and seek any duplicates — iod, Apr 07 '22 at 18:50

score 2 · Answer 1 · answered Apr 07 '22 at 19:22

You can do this using get_dupes from janitor package.

library(tidyverse)
library(janitor)

# Added a new column 'df_id' to identify the data frame
df1 = data.frame(id = c(1,2,3), names = c("john", "alex", "peter"), df_id = 1) 
df2 = data.frame(id = c(1,2,3), names = c("alex", "john", "peter"), df_id = 2)
df3 = data.frame(id = c(1,2,3), names = c("peter", "john", "tim"), df_id = 3)

# Bind dataframes
# Get duplicates
df1 %>% 
  bind_rows(df2) %>% 
  bind_rows(df3) %>% 
  get_dupes(c(id, names))

#>   id names dupe_count df_id
#> 1  2  john          2     2
#> 2  2  john          2     3
#> 3  3 peter          2     1
#> 4  3 peter          2     2

score 2 · Accepted Answer · answered Apr 07 '22 at 19:56

2

Here is how you could achieve your task:

library(dplyr)

bind_rows(df1, df2, df3) %>% 
  group_by(id, names) %>% 
  filter(n()>1) %>% 
  unique()

     id names
  <dbl> <chr>
1     3 peter
2     2 john

answered Apr 07 '22 at 19:56

TarJae

72,363
6
19
66

score 1 · Answer 3 · answered Apr 07 '22 at 18:59

1

Does this count?

dfall<-bind_rows(df1,df2,df3)
dfall[duplicated(dfall),]
  id names
6  3 peter
8  2  john

answered Apr 07 '22 at 18:59

iod

7,412
2
17
36

PaulS · Answer 4 · 2022-04-07T19:13:35.653

A possible solution (in case you want a dataframe as result, just pipe bind_rows at the end):

library(dplyr)

combn(paste0("df", 1:3), 2, simplify = F, \(x) inner_join(get(x[1]), get(x[2]))) 

#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> Joining, by = c("id", "names")
#> [[1]]
#>   id names
#> 1  3 peter
#> 
#> [[2]]
#> [1] id    names
#> <0 rows> (or 0-length row.names)
#> 
#> [[3]]
#>   id names
#> 1  2  john

Finding Identical Rows in Multiple Datasets

4 Answers4

Linked