0

I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?

 Probesets=paste("a",1:200,sep="")
 Genes=sample(letters,200,replace=T)
 Value=rnorm(200)
 X=data.frame(Probesets,Genes,Value)
 X=X[order(X$Value,decreasing=T),]
 Y=X[which(!duplicated(X$Genes)),]
Sylvia Rodriguez
  • 1,203
  • 2
  • 11
  • 30
  • 1
    The duplicates are sorted in such way that the maximum of each gene is left; try `all.equal(sort(Y[, "Value"]), as.numeric(sort(with(X, tapply(Value, Genes, max)))))` which yields `TRUE`. BTW, you may omit the `which` in your code and write just `X[!duplicated(X$Genes),]`. – jay.sf Apr 16 '20 at 11:32
  • 1
    Thank you for clarifying and for your suggestion! :) – Sylvia Rodriguez Apr 16 '20 at 12:42

2 Answers2

1

Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:

Y=X[which(!duplicated(X$Genes)),]

Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:

nrow(Y); length(unique(X$Genes))
[1] 26
[1] 26

If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:

Y=X[!duplicated(X),]

To see how it works consider this example:

df <- data.frame(
  a = c(1,1,2,3),
  b = c(1,1,3,4)
)
df
  a b
1 1 1
2 1 1
3 2 3
4 3 4

df[!duplicated(df),]
  a b
1 1 1
3 2 3
4 3 4
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Thank you very much for your detailed answer. Yes, my intention is to keep only rows for one unique Gene. You are totally right that this will remove rows that do not contain duplicate values across all columns. Strictly speaking, you are completely correct, but I this case that is the intention. Thank you. – Sylvia Rodriguez Apr 16 '20 at 12:45
1

Your code is keeping the records containing maximum value per gene.

hello_friend
  • 5,682
  • 1
  • 11
  • 15