Delete values from a dataframe that also appear in another dataframe

Question

I have two dataframes, df1 contains many values and df2 contains a few values that also appear in df1. The values in df2 are ones that I want to delete from df1.

I have tried to do this by merging, but that doesn't seem to have an option to only keep values that aren't in both.

I also tried using the code from this answer to a similar question. This seemed to work, but it produced a dataframe with less values than I was expecting i.e. df1 contained 74911 values, df2 contained 767, and after removing them, there were 74064 remaining - so 80 additional rows were deleted. I'm not sure why this happened, if I could identify the 80 rows perhaps I could figure it out.

If anyone can think of an alternative way of achieving my goal I will be very grateful!

Here are some example dataframes, they are very simple compared to the real ones:

chrom <- c(1, 2, 3, 4)
pos <- c(2, 7, 9, 14)
seq_c <- c('A', 'G', 'C', 'T')
seq_k <- c('G', 'C', 'A', 'C')
df1 <- data.frame(chrom, pos, seq_c, seq_k)

chrom <- c(1, 2)
pos <- c(2, 7)
seq_c <- c('A', 'G')
seq_k <- c('G', 'C')
df2 <- data.frame(chrom, pos, seq_c, seq_k)

Expected output would then be:

chrom <- c(3, 4)
pos <- c(9, 14)
seq_c <- c('C', 'T')
seq_k <- c('A', 'C')
df3 <- data.frame(chrom, pos, seq_c, seq_k)

Do you want to drop rows where the entire row is present in another data frame (as opposed to a single ID key value present in another data frame)? That, and whether you have duplicate values of the key variable(s), could play a role in why more rows are being dropped than expected. — Sam Firke, Aug 12 '15 at 14:41

akrun · Answer 1 · 2015-08-12T14:31:22.057

3

We can use anti_join from dplyr which will remove the rows that are common between 'df1' and 'df2', and keep only the rest.

 library(dplyr)
 anti_join(df1, df2)

edited Aug 12 '15 at 14:31

answered Aug 12 '15 at 14:25

akrun

874,273
37
540
662

1

That sounds like a cool function, but dplyr is not available for my R version (3.1.1), and bgoldst's answer worked for me. Thanks for the suggestion! – Margins Aug 12 '15 at 14:38

bgoldst · Accepted Answer · 2015-08-12T14:37:22.293

3

Here's an idea using merge() and then removing from df1 all row indexes that successfully merged with df2:

df1[-merge(cbind(df1,ri=seq_len(nrow(df1))),df2)$ri,];
##   chrom pos seq_c seq_k
## 3     3   9     C     A
## 4     4  14     T     C

And data.table has a nice elegant way of doing this:

library('data.table');
dt1 <- data.table(df1,key=names(df1));
dt2 <- data.table(df2,key=names(df2));
dt1[!dt2];
##    chrom pos seq_c seq_k
## 1:     3   9     C     A
## 2:     4  14     T     C

Note that the exclamation mark in the i argument to [.data.table() has a special meaning here; [.data.table() takes it to mean negation of the join against dt2, whereas normally it would negate each cell of the RHS data.table.

edited Aug 12 '15 at 14:37

answered Aug 12 '15 at 14:28

bgoldst

34,190
6
38
64

This worked perfectly, I got the expected 74144 rows. Thank you so much! – Margins Aug 12 '15 at 14:35
2

In the devel version of data.table, you don't need to set the keys, just use `on=` – akrun Aug 12 '15 at 14:51

score 3 · Answer 3 · answered Aug 12 '15 at 14:31

I recreated your data.frame using stringAsFactors = FALSE, in that case, you can just use setdiff from dplyr:

chrom <- c(1, 2, 3, 4)
pos <- c(2, 7, 9, 14)
seq_c <- c('A', 'G', 'C', 'T')
seq_k <- c('G', 'C', 'A', 'C')
df1 <- data.frame(chrom, pos, seq_c, seq_k, stringsAsFactors = FALSE)

chrom <- c(1, 2)
pos <- c(2, 7)
seq_c <- c('A', 'G')
seq_k <- c('G', 'C')
df2 <- data.frame(chrom, pos, seq_c, seq_k, stringsAsFactors = FALSE)

library(dplyr)
df1 %>% setdiff(df2)

Delete values from a dataframe that also appear in another dataframe

3 Answers3