0

I have two dataframes, df1 contains many values and df2 contains a few values that also appear in df1. The values in df2 are ones that I want to delete from df1.

I have tried to do this by merging, but that doesn't seem to have an option to only keep values that aren't in both.

I also tried using the code from this answer to a similar question. This seemed to work, but it produced a dataframe with less values than I was expecting i.e. df1 contained 74911 values, df2 contained 767, and after removing them, there were 74064 remaining - so 80 additional rows were deleted. I'm not sure why this happened, if I could identify the 80 rows perhaps I could figure it out.

If anyone can think of an alternative way of achieving my goal I will be very grateful!

Here are some example dataframes, they are very simple compared to the real ones:

chrom <- c(1, 2, 3, 4)
pos <- c(2, 7, 9, 14)
seq_c <- c('A', 'G', 'C', 'T')
seq_k <- c('G', 'C', 'A', 'C')
df1 <- data.frame(chrom, pos, seq_c, seq_k)

chrom <- c(1, 2)
pos <- c(2, 7)
seq_c <- c('A', 'G')
seq_k <- c('G', 'C')
df2 <- data.frame(chrom, pos, seq_c, seq_k)

Expected output would then be:

chrom <- c(3, 4)
pos <- c(9, 14)
seq_c <- c('C', 'T')
seq_k <- c('A', 'C')
df3 <- data.frame(chrom, pos, seq_c, seq_k)
Community
  • 1
  • 1
Margins
  • 89
  • 8
  • Do you want to drop rows where the entire row is present in another data frame (as opposed to a single ID key value present in another data frame)? That, and whether you have duplicate values of the key variable(s), could play a role in why more rows are being dropped than expected. – Sam Firke Aug 12 '15 at 14:41
  • Ones where the entire row is present in another dataframe. – Margins Aug 12 '15 at 14:51

3 Answers3

3

We can use anti_join from dplyr which will remove the rows that are common between 'df1' and 'df2', and keep only the rest.

 library(dplyr)
 anti_join(df1, df2)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    That sounds like a cool function, but dplyr is not available for my R version (3.1.1), and bgoldst's answer worked for me. Thanks for the suggestion! – Margins Aug 12 '15 at 14:38
3

Here's an idea using merge() and then removing from df1 all row indexes that successfully merged with df2:

df1[-merge(cbind(df1,ri=seq_len(nrow(df1))),df2)$ri,];
##   chrom pos seq_c seq_k
## 3     3   9     C     A
## 4     4  14     T     C

And data.table has a nice elegant way of doing this:

library('data.table');
dt1 <- data.table(df1,key=names(df1));
dt2 <- data.table(df2,key=names(df2));
dt1[!dt2];
##    chrom pos seq_c seq_k
## 1:     3   9     C     A
## 2:     4  14     T     C

Note that the exclamation mark in the i argument to [.data.table() has a special meaning here; [.data.table() takes it to mean negation of the join against dt2, whereas normally it would negate each cell of the RHS data.table.

bgoldst
  • 34,190
  • 6
  • 38
  • 64
3

I recreated your data.frame using stringAsFactors = FALSE, in that case, you can just use setdiff from dplyr:

chrom <- c(1, 2, 3, 4)
pos <- c(2, 7, 9, 14)
seq_c <- c('A', 'G', 'C', 'T')
seq_k <- c('G', 'C', 'A', 'C')
df1 <- data.frame(chrom, pos, seq_c, seq_k, stringsAsFactors = FALSE)

chrom <- c(1, 2)
pos <- c(2, 7)
seq_c <- c('A', 'G')
seq_k <- c('G', 'C')
df2 <- data.frame(chrom, pos, seq_c, seq_k, stringsAsFactors = FALSE)

library(dplyr)
df1 %>% setdiff(df2)
Teo
  • 67
  • 4