19

I'm trying to solve a tricky R problem that I haven't been able to solve via Googling keywords. Specifically, I'm trying to take a subset one data frame whose values don't appear in another. Here is an example:

> test
      number    fruit     ID1  ID2 
item1 "number1" "apples"  "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples"  "12" "13"
> test2
      number    fruit     ID1   ID2 
item1 "number1" "papayas" "22"  "33"
item2 "number2" "oranges" "13"  "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples"  "123" "13"
item5 "number3" "peaches" "44"  "25"
item6 "number4" "apples"  "12"  "13"
item7 "number1" "apples"  "22"  "33"

I have two data frames, test and test2, and the goal is to select all entire rows in test2 that don't appear in test, even though some of the values may be the same.

The output I want would look like:

item1 "number1" "papayas" "22"  "33"
item2 "number3" "peaches" "441" "25"
item3 "number4" "apples"  "123" "13"

There may be an arbitrary amount of rows or columns, but in my specific case, one data frame is a direct subset of the other.

I've used the R subset(), merge() and which() functions extensively, but couldn't figure out how to use these in combination, if it's possible at all, to get what I want.

edit: Here is the R code I used to generate these two tables.

test <- data.frame(c("number1", "apples", 22, 33), c("number2", "oranges", 13, 33),
    c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13))

test <- t(test)
rownames(test) = c("item1", "item2", "item3", "item4")
colnames(test) = c("number", "fruit", "ID1", "ID2")

test2 <- data.frame(data.frame(c("number1", "papayas", 22, 33), c("number2", "oranges", 13, 33),
    c("number3", "peaches", 441, 25), c("number4", "apples", 123, 13),c("number3", "peaches", 44, 25), c("number4", "apples", 12, 13)  ))

test2 <- t(test2)
rownames(test2) = c("item1", "item2", "item3", "item4", "item5", "item6")
colnames(test2) = c("number", "fruit", "ID1", "ID2")

Thanks in advance!

so13eit
  • 942
  • 3
  • 11
  • 22
  • You don't have data frames, you have matrices. Don't use `t` on a data frame. Luckily for you, `merge` was smart enough to turn your matrices back into a data frame. Unluckily for me, it wasn't smart enough to turn `id`, now of class `factor`, back into a numeric.... – Hong Ooi Jul 02 '13 at 14:35
  • Thanks for trying, Hong! My real dataset is of class data.frame and unfortunately, your code didn't work for that one either =\. The object Matches comes out to be a NULL object – so13eit Jul 02 '13 at 14:44
  • The rownames in your desired output don't match `test2`. Is that a mistake? – Matthew Plourde Jul 02 '13 at 14:44
  • Oh, yes. Please ignore those row names, that was a hypothetical output! – so13eit Jul 02 '13 at 15:03
  • I wish there was a way to negate `match_df` from `plyr` – tumultous_rooster Aug 14 '15 at 20:16

6 Answers6

17

Here's another way:

x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
#        number   fruit ID1 ID2
# item1 number1 papayas  22  33
# item3 number3 peaches 441  25
# item4 number4  apples 123  13

Edit: modified to preserve row names.

Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • I'm trying you code on my real data and it seems to work well. The one minor problem I have with it is that it deletes all duplicates, which may not be ideal (as some records should actually have duplicates, such as if one customer purchased two of the same product). Still, it's good enough for the data I have now, thanks! – so13eit Jul 02 '13 at 15:03
  • This is a great solution to what I have been calling the "double negative" problem in subsetting. I was trying to do a one-step subset to eliminate values from a DF that had a combination of values on two cols that I did NOT want; e.g., "subset(df, HIYA != "alpha" & BYA != "beta") -- where what I wanted was everything except those cases where HIYA = alpha and BYA = beta. This solves it! – WGray Nov 11 '14 at 19:39
  • Looks like a great answer but in all honestly I can't read the code. Why do we need a sequence? – Monica Heddneck Jan 10 '17 at 02:12
  • 1
    The `!dulpicated()` returns a vector of logicals the length of x with TRUE where not duplicated. `fromLast` pushes the TRUEs to the end of the vector (since they are now the "first" occurrences) . The `seq()<=` creates a vector with `length(test2)` TRUEs followed by 'length(test)` FALSEs, which filter out all the spurious TRUEs at the end. An equivalent construct would be: `x <- rbind(test, test2)` `x[! duplicated(x) & seq(nrow(x)) > nrow(test2), ]` – verbamour Jan 19 '17 at 17:45
4

There are two ways to solve this, using data.table and sqldf

library(data.table)
test<- fread('
item number fruit ID1 ID2 
item1 "number1" "apples"  "22" "33"
item2 "number2" "oranges" "13" "33"
item3 "number3" "peaches" "44" "25"
item4 "number4" "apples"  "12" "13"
')
test2<- fread('
item number fruit ID1 ID2 
item1 "number1" "papayas" "22"  "33"
item2 "number2" "oranges" "13"  "33"
item3 "number3" "peaches" "441" "25"
item4 "number4" "apples"  "123" "13"
item5 "number3" "peaches" "44"  "25"
item6 "number4" "apples"  "12"  "13"
item7 "number1" "apples"  "22"  "33"
')

data.table approach, this enables you to select which columns you want to compare

setkey(test,item,number,fruit,ID1,ID2)
setkey(test2,item,number,fruit,ID1,ID2)
test[!test2]
item  number   fruit ID1 ID2
1: item1 number1  apples  22  33
2: item3 number3 peaches  44  25
3: item4 number4  apples  12  13

Sql approach

sqldf('select * from test except select * from test2')
item  number   fruit ID1 ID2
1: item1 number1  apples  22  33
2: item3 number3 peaches  44  25
3: item4 number4  apples  12  13
usct01
  • 838
  • 7
  • 18
3

The following should get you there:

rows <- unique(unlist(mapply(function(x, y) 
          sapply(setdiff(x, y), function(d) which(x==d)), test2, test1)))
test2[rows, ]

What's happening here is:

  • mapply is used to do a column-wise comparison between the two datasets.
  • It uses setdiff to find any item which are in the former but not the latter
  • which identifies which row of the former is not present.
  • unique(unlist(....)) grabs all unique rows

  • Then we use that as a filter to the former, ie test2

Results:

       number   fruit ID1 ID2
item1 number1 papayas  22  33
item3 number3 peaches 441  25
item4 number4  apples 123  13

edit:

Make sure that your test & test2 are data.frames and not matrices, since mapply iterates over each element of a matrix, but over each column of a data.frame

test  <- as.data.frame(test,  stringsAsFactors=FALSE)
test2 <- as.data.frame(test2, stringsAsFactors=FALSE)
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Unfortunately, this only gives one line as an output, and it's not one that's repeated! I edited the Original post to include the code I used to generate the tables, maybe you can have a look yourself – so13eit Jul 02 '13 at 14:34
  • ahh... you said you had `data.frames` when in fact you had `matrices`. Please see the edit. Convert to data.frame and all is well ;) – Ricardo Saporta Jul 02 '13 at 14:41
  • All is not well... yet! As you can see by your results, they actually aren't correct. item2 is repeated in both 'test' and 'test2', and two unique rows in test2 don't show up in the results. – so13eit Jul 02 '13 at 14:47
  • My mistake, had an error in the `mapply` function. However, the data still needs to be data.frames **and** strings must NOT be factors. – Ricardo Saporta Jul 02 '13 at 15:03
3

Using the package dplyr, you can also use anti_join.

missing.species <- anti_join(test2, test, by = NULL)

It will return rows of test2 that do not have a match in test. By explicit the variables to join by. If NULL, the function will use all variables in common across test and test2.

1

Make a new row-ID column in test2, merge the data frames, and select those rows whose IDs aren't in the merged result.

test2 <- cbind(test2, id=seq_len(nrow(test2)))

matches <- merge(test1, test2)$id

test2 <- test2[-matches, ]
Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • This gives an error: Warning message: In Ops.factor(matches) : - not meaningful for factors I've edited the code I used into the OP – so13eit Jul 02 '13 at 14:33
1

Here's another approach, but I'm not sure how well it would scale.

test2[!apply(test2, 1, paste, collapse = "") %in% 
        apply(test, 1, paste, collapse = ""), ]
#       number    fruit     ID1   ID2 
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples"  "123" "13"

This would not delete all duplicates. Compare, for example, if test2 had duplicates:

test2 <- rbind(test2, test2[1:3, ])

## Matthew's answer: Duplicates dropped
x <- rbind(test2, test)
x[! duplicated(x, fromLast=TRUE) & seq(nrow(x)) <= nrow(test2), ]
#       number    fruit     ID1   ID2 
# item4 "number4" "apples"  "123" "13"
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"

## This one: Duplicates retained
test2[!apply(test2, 1, paste, collapse = "") %in%
  apply(test, 1, paste, collapse = ""), ]
#       number    fruit     ID1   ID2 
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
# item4 "number4" "apples"  "123" "13"
# item1 "number1" "papayas" "22"  "33"
# item3 "number3" "peaches" "441" "25"
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485