Subset common rows from multiple data frames

Question

I have multiple dataframes like mentioned below with unique id for each row. I am trying to find common rows and make a new dataframe which is appearing at least in two dataframes.

example- row with Id=2 is appearing in all three dataframes. similarly row with Id= 3 is there in df1 and df3.

I want to make a loop which can find common rows and create a new dataframe with common rows.

df1 <- data.frame(Id=c(1,2,3,4),a=c(0,1,0,2),b=c(1,0,1,0),c=c(0,0,4,0)) 
df2 <- data.frame(Id=c(7,2,5,9),a=c(4,1,9,2),b=c(1,0,1,5),c=c(3,0,7,0))
df3 <- data.frame(Id=c(5,3,2,6),a=c(9,0,1,5),b=c(1,1,0,0),c=c(7,4,0,0)) 

> df1                   > df2 
 Id | a | b | c |         Id | a | b | c |
 ---|---|---|---|         ---|---|---|---|                  
  1 | 0 | 1 | 0 |          7 | 4 | 1 | 3 |                           
 ---|---|---|---|         ---|---|---|---|                  
  2 | 1 | 0 | 0 |          2 | 1 | 0 | 0 |
 ---|---|---|---|         ---|---|---|---|
  3 | 0 | 1 | 4 |          5 | 9 | 1 | 7 |
 ---|---|---|---|         ---|---|---|---|
  4 | 2 | 0 | 0 |          9 | 2 | 5 | 0 |

 > df3
 Id | a | b | c |
 ---|---|---|---|
  5 | 9 | 1 | 7 |
 ---|---|---|---|
  3 | 0 | 1 | 4 |
 ---|---|---|---|
  2 | 1 | 0 | 0 |
 ---|---|---|---|
  6 | 5 | 0 | 0 |

> expected_output
 Id | a | b | c |
 ---|---|---|---|
  5 | 9 | 1 | 7 |
 ---|---|---|---|
  3 | 0 | 1 | 4 |
 ---|---|---|---|
  2 | 1 | 0 | 0 |
 ---|---|---|---|

Note:- ID is unique. Also, i want to remove rows from original dataframes which are duplicated and I am using it to create new dataframe.

Can a row appear multiple times in the same table? – Frank Jul 28 '17 at 19:48 — Frank, Jul 28 '17 at 19:48

score 2 · Accepted Answer · answered Jul 28 '17 at 19:58

2

I have multiple dataframes like mentioned below with unique id for each row. I am trying to find common rows and make a new dataframe which is appearing at least in two dataframes.

Since no ID appears twice in the same table, we can tabulate the IDs and keep any found twice:

library(data.table)

DTs = lapply(list(df1,df2,df3), data.table)

Id_keep = rbindlist(lapply(DTs, `[`, j = "Id"))[, .N, by=Id][N >= 2L, Id]

DT_keep = Reduce(funion, DTs)[Id %in% Id_keep]

#    Id a b c
# 1:  2 1 0 0
# 2:  3 0 1 4
# 3:  5 9 1 7

Your data should be in an object like DTs to begin with, not a bunch of separate named objects.

How it works

To get a sense of how it works, examine intermediate objects like

list(df1,df2,df3)
lapply(DTs, `[`, j = "Id")
Reduce(funion, DTs)

Also, read the help files, like ?lapply, ?rbindlist, ?funion.

answered Jul 28 '17 at 19:58

Frank

66,179
8
96
180

but, it doesn't remove duplicated rows from original df's. – Jul 28 '17 at 20:09
@RDP I don't understand. You want to alter the input tables? I only see the single "expected_output" above and so was creating that. Or you mean that an ID *can* appear twice in input tables (... which is why I asked)? – Frank Jul 28 '17 at 20:10
Yes I want new expected output. but, also want to remove rows which were common – Jul 28 '17 at 20:11
Ah ok, if you ask and Paolo's okay with it, you could edit your question with that (since I don't see it there yet). Otherwise, you could post it as a new question. Fwiw, I think `lapply(DTs, fsetdiff, DT_keep)` does the trick..? – Frank Jul 28 '17 at 20:13
@RDP Okay. I think good etiquette is not to edit new stuff into the question unless everyone who's posted an answer is okay with it (so, including Paolo). Is `lapply(DTs, fsetdiff, DT_keep)` what you were after? If not, maybe you need to show desired output for that part of the question as well. – Frank Jul 28 '17 at 22:09
Sure. I can try and repost again with desired output – Jul 28 '17 at 22:10

Paolo · Answer 2 · 2017-07-28T19:20:08.207

1

Combine all of the data frames:

combined <- rbind(df1, df2, df3)

Extract the duplicates:

duplicate_rows <- unique(combined[duplicated(combined), ])

(duplicated(combined) gives you the row indices of duplicate rows)

edited Jul 28 '17 at 19:20

answered Jul 28 '17 at 19:15

Paolo

3,825
4
25
41

I have more than 20 dataframes. I thought of this but its not a good idea to do for more dataframes. – Jul 28 '17 at 19:17
@RDP how are the data frames stored? Are they in a `list`, similar names, etc? – Paolo Jul 28 '17 at 19:21
If you are concerned about dataframe binding efficiency, `rbindlist` function in `data.table` package does the dataframes binding very efficiently. Just try `combined <- rbindlist(list(df1, df2, df3))` – Deena Jul 28 '17 at 19:24
You can use `do.call(rbind, mget(ls(pattern = 'df[0-9]+')))` to make the initial data frame – Sotos Jul 28 '17 at 19:34

Subset common rows from multiple data frames

2 Answers2

Linked