How to subset by distinct rows in a data frame or matrix?

Question

Suppose I had the following matrix:

matrix(c(1,1,2,1,2,3,2,1,3,2,2,1),ncol=3)

Result:

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    3    2
[3,]    2    2    2
[4,]    1    1    1

How can I filter/subset this matrix by whether or not each row has duplicate values? For example, in this case, I would only want to keep row 1 and row 2.

Any thoughts would be much appreciated!

score 4 · Accepted Answer · answered Jun 18 '15 at 23:43

4

Try this: (I suspect will be faster than any apply approach)

 mat[ rowSums(mat == mat[,1])!=ncol(mat) , ]
# ---with your object---
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    3    2

answered Jun 18 '15 at 23:43

IRTFM

258,963
21
364
487

Yep - 0.03 seconds for a 1M row, 4 col matrix over here. Impressive. – thelatemail Jun 18 '15 at 23:46
1

Vectorized functions like `rowSums` and `==` beat `apply/loops every time. – IRTFM Jun 18 '15 at 23:49
Just realised this returns a positive where there is for instance `c(1,2,1)` in a row. – thelatemail Jun 19 '15 at 00:02
This is great! I knew there was a faster way – Pierre L Jun 19 '15 at 00:02
@thelatemail: Yes. That was how it was intended. That's what I understood the request to be, but changing the test could make it return a different set of rows. – IRTFM Jun 19 '15 at 00:16
@BondedDust thanks! this approach works wonderfully. I have modified it to this `mat[rowSums(mat == mat[,1])==1 & rowSums(mat == mat[,2])==1, ]` so that it gives me rows with all distinct values – eyio Jun 19 '15 at 00:17
After testing, this code only works for thin data sets. You would have to repeat the indexing ncol minus one times. I take my vote away! :) – Pierre L Jun 19 '15 at 00:55

Pierre L · Answer 2 · 2015-06-18T23:32:39.780

2

indx <- apply(m, 1, function(x) !any(duplicated(x)))
m[indx, ]
#     [,1] [,2] [,3]
#[1,]    1    2    3
#[2,]    1    3    2

This second one is just for fun. You can follow the logic to see why it works.

indx2 <- apply(m, 1, function(x) length(unique(x)) == length(x))
m[indx2,]
#     [,1] [,2] [,3]
#[1,]    1    2    3
#[2,]    1    3    2

edited Jun 18 '15 at 23:32

answered Jun 18 '15 at 23:25

Pierre L

28,203
6
47
69

the second approach is interesting, thanks for sharing! – eyio Jun 19 '15 at 00:19

score 2 · Answer 3 · answered Jun 18 '15 at 23:36

2

Here is my approach just a little bit shorter that use the anyDuplicated function, which should be faster.

mat[!apply(mat, 1, anyDuplicated), ]
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    3    2

answered Jun 18 '15 at 23:36

SabDeM

7,050
2
25
38

How to subset by distinct rows in a data frame or matrix?

3 Answers3