4

I've been thinking this problem for a whole night: here is my matrix:

'a' '#' 3
'#' 'a' 3
 0  'I am' 2
'I am' 0 2

.....

I want to treat the rows like the first two rows are the same, because it's just different order of 'a' and '#'. In my case, I want to delete such kind of rows. The toy example is simple, the first two are the same, the third and the forth are the same. but in my data set, I don't know where is the 'same' row.

I'm writing in R. Thanks.

Henrik
  • 65,555
  • 14
  • 143
  • 159
Jiang Du
  • 189
  • 2
  • 14
  • 1
    Do you want to remove both duplicates or just one? – CCurtis Apr 10 '14 at 06:50
  • 1
    What output you want to get? `F T F T` or `T T T T` ? (`F`-not dup, `T`-dup) – bartektartanus Apr 10 '14 at 07:15
  • I think this is pretty close but I'm getting an error. Strange because it works if you manually specify i and n but when I let repeat and for control them it goofs up. Its supposed to label all repeat rows NA. then you can just remove them `for(i in 1:length(df[,1])){x=(1:length(df[,1])) x=x[!x==i] for(n in x){if(sort(df[i,])[1]==sort(df[n,])[1]&sort(df[i,])[2]==sort(df[n,])[2]&sort(df[i,])[3]==sort(df[n,])[3]) df[n,1:3] <- NA} }` – CCurtis Apr 10 '14 at 08:05
  • The output I want is :F T F T or T F T F, in which way I can use the indicator to pick out the rows – Jiang Du Apr 10 '14 at 14:37

3 Answers3

5

Perhaps something like this would work for you. It is not clear what your desired output is though.

x <- structure(c("a", "#", "0", "I am", "#", "a", "I am", "0", "3", 
                 "3", "2", "2"), .Dim = c(4L, 3L))
x
#      [,1]   [,2]   [,3]
# [1,] "a"    "#"    "3" 
# [2,] "#"    "a"    "3" 
# [3,] "0"    "I am" "2" 
# [4,] "I am" "0"    "2" 


duplicated(
  lapply(1:nrow(x), function(y){
    A <- x[y, ]
    A[order(A)]
  }))
# [1] FALSE  TRUE FALSE  TRUE

This basically splits the matrix up by row, then sorts each row. duplicated works on lists too, so you just wrap the whole thing with `duplicated to find which items (rows) are duplicated.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thanks for your help. But I have something wrong with my test data: /// x=matrix(c(0,3,2,3,0,1,2,1,0),3,3)/// z=as.vector(x)/// ind=z>=1/// y=c('a','b','c')/// yy=expand.grid(y,y)//// yyy=cbind(yy,z)[ind,]//// duplicated( lapply(1:nrow(yyy), function(y){ A <- yyy[y, ] A[order(A)] })) /// [1] FALSE FALSE FALSE FALSE FALSE FALSE///I don't know how to make the code not showing on the same line. Sorry. – Jiang Du Apr 10 '14 at 14:49
  • Happened to me aswell! The reason is likely that you have column names assigned in x. This is what happens: `order(A)` orders the row neatly and returns the ordered version of the row with column names. The resulting object from `lapply`, however, respects the column names and hands over to `duplicated` a version where the column names are intact. Thus, what is considered by duplicated is the same as x! See my answer for a solution. – agoldev Jan 26 '17 at 14:07
5

For me, this produced also just a vector of FALSE, meaning that it detected no duplicates. I think this is what happened: I had column names assigned in x. Thus, although order(A) ordered the row neatly and returns the ordered version of the row with column names, the resulting object from lapply respects the column names and hands over to duplicated() a version where the columns are intact (because of the names). Thus, what is considered by duplicated() is the same as x!

I did this inspired by the answer of @A Handcart And Mohair which worked for me:

duplicated(t(apply(x, 1, sort)))

It is also shorter ;)

Note that the example by @A Handcart And Mohair works with his sample data. But if you have named columns, it fails.

agoldev
  • 2,078
  • 3
  • 23
  • 38
1

As a start, you might want to refer to the documentation for an excellent R package called duplicated. As the package notes, "duplicated() determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates." Some examples that they provide are:

Example 1:

duplicated(iris)[140:143]

Example 2:

duplicated(iris3, MARGIN = c(1, 3))

Example3

anyDuplicated(iris)

Example 4

anyDuplicated(x)

Example 5

anyDuplicated(x, fromLast = TRUE)

EDIT: If you wanted to do it the long way, you might think of comparing every row to every other row in the data from character by character. To do this, imagine that the first row has 3 characters. For each row, you loop through and check to see if they have this character. If they do, you then reduce and check the next character. Approaching this using a self created recursive function which compares a value in a string to all other rows in the dataframe or matrix (and then subsets ONLY on rows that do not match any other rows), could work.

Nathaniel Payne
  • 2,749
  • 1
  • 28
  • 32