Subset duplicates based on two columns

Question

My data looks like this:

I want to subset the data, and extract all records that are duplicates, based on values on both columns. I tried using cbind, and unique, but they extract only the unique values. I couldnt find a reverse subset function, if that can help. Thx.

akrun · Accepted Answer · 2015-03-09T11:27:17.367

2

You can try

 df1[duplicated(df1)|duplicated(df1, fromLast=TRUE),]
 #    A B
 #2  1A 2
 #3  1A 2
 #5   2 4
 #6   2 4
 #7  3A 0
 #8  3A 0
 #9  4A 1
 #10 4A 1

data

 df1 <- structure(list(A = c("1", "1A", "1A", "2", "2", "2", "3A",
 "3A", 
 "4A", "4A", "5"), B = c(2L, 2L, 2L, 3L, 4L, 4L, 0L, 0L, 1L, 1L, 
 5L)), .Names = c("A", "B"), class = "data.frame", row.names = c(NA, 
 -11L))

edited Mar 09 '15 at 11:27

answered Mar 09 '15 at 11:21

akrun

874,273
37
540
662

It returns an error: Error in `[.data.frame`(b, duplicated(b) | duplicated(b, fromLast = T)) : undefined columns selected – Litwos Mar 09 '15 at 11:25
@Litwos Based on the `dput` output in my post, it is not giving any errors. Please copy/paste the dput output and see if the error persists. – akrun Mar 09 '15 at 11:28
It worked, but I transformed the column to factor (as.factor). Is that necessary? I will now try on all my data. – Litwos Mar 09 '15 at 11:31
@Litwos It is not necessary. I wouldn't work with factors unless it is needed for a specific purpose, If you look at the `str(df1)`, these are non-factor columns. One problem with factor column is that after you subset you may not to drop the unused levels. ie. `droplevels(df1[duplicated(...)` – akrun Mar 09 '15 at 11:34
Understood. Thx a lot for your help. I will now try to find a function to count the number of duplicates in a new column, but that's for another thread. :) – Litwos Mar 09 '15 at 11:41

Subset duplicates based on two columns

1 Answers1

data