Function for sampling between duplicated values in data.frame

Question

So, I have a data.frame object called "DATA". This object contains one column called "Point"(DATA$Point). Since there are some duplicates on this particular column, I would like to build a function that sample only one row among these duplicates in DATA.

I've been trying to do it this way:

sort.song<-function(DATA){

 Point<-levels(DATA$Point)
 DATA.NEW<-DATA[1:length(Point),] 

#Ideally DATA.NEW should have an empty dataframe with nrow=length(Point) and the same columns
#as in DATA. But I THINK it will work (I don't know how to do the "ideally" way)

 for(i in 1:dim(DATA)[1]){ #dim(DATA)[1] always bigger than length(Point)
  SUBDATA<-DATA[which(DATA$Point%in%Point[i]),]

#I need to sample one row of the original data set only of the duplicates of the same value.
#So if there isn't a duplicate of one particular value, move on. Otherwise sample one between
#those duplicates.

  l<-dim(SUBDATA)[1]
  if (l==1){DATA.NEW[i,]<-SUBDATA[l,]}else{lc<-sample(1:l,1)}
  DATA.NEW[i,]<-SUBDATA[lc,]
  }
 return(DATA.NEW)
}

test<-sort.song(DATA)

But it doesn't work! :( I get the following error message:

Error in `[<-.factor`(`*tmp*`, iseq, value = integer(0)) : 
replacement has length zero

It may be a silly question, but I'm kind of without options here (total R beginner)

Any help will be highly appreciated!!!!

Do you want to sample the duplicates at random, if not something like this would work `DATA[!duplicated(DATA$Point), ]` — waferthin, Apr 15 '14 at 13:08
Yes, I would like to randomly sample duplicates (including the value in which the duplicates are based). I mean, the function duplicated() show me only the duplicated values. I want to sample between duplicates AND the value in which it is duplicated. Ok, I may get confusing since I'm a total newbie in R. — Mohr, Apr 15 '14 at 13:17

score 0 · Answer 1 · answered Apr 15 '14 at 13:16

0

R has built in functions, sample and duplicated. Thus you can simply use

DATA[ sample( !duplicated(DATA$Point), N ), ]
# where `N` is the sample size you'd like.

in data.table syntax, the above would be

DATA[ sample( !duplicated(Point), N )]

answered Apr 15 '14 at 13:16

Ricardo Saporta

54,400
17
144
178

I wasn't clear enough in my question. Using !duplicated will give only those that are not duplicates, right? I want to sample only the duplicated ones. I included some additional info on the original question. – Mohr Apr 15 '14 at 13:27
you can remove the bang (`!`). If you want all such rows, use `sample(DATA$Point %in% DATA$Point[duplicated(DATA$Point)])` – Ricardo Saporta Apr 15 '14 at 15:40

score 0 · Answer 2 · answered Apr 15 '14 at 13:28

0

So you want every row that is not duplicated AND the first instance of those that are duplicated right ?

Then try this:

# build fake dataset
DATA <- as.data.frame(cbind(sample(c(1:10,3:7)),sample(1:15),sample(1:15)))
names(DATA) <- c("Point","some_col","some_other_col")

# check
print(DATA) # See Point has duplicate values


# your function
filter_data <- function(DATA){
distinct_points <- unique(DATA$Point)
as.data.frame(t(sapply(distinct_points, function(x){subset(DATA,Point == x)[1,]})))
}


#result
DATA.new <- filter_data(DATA)
print(DATA.new)

answered Apr 15 '14 at 13:28

moodymudskipper

46,417
11
121
167

Could you please explain to me what this string command means? as.data.frame(t(sapply(distinct_points, function(x){subset(DATA,Point == x)[1,]}))) } – Mohr Apr 15 '14 at 13:37
sapply takes all the distinct points one by one and take the first row of DATA with this value, it returns a matrix that i have to transpose to get it back in the initial format, then convert back to data.frame. – moodymudskipper Apr 15 '14 at 14:07
Great example. Is this a random selection from the multiplicators @moodymudskipper? – Sander W. van der Laan Oct 11 '22 at 14:31

score 0 · Accepted Answer · answered Apr 15 '14 at 13:37

0

If you want to chose a random duplicate to keep, rather than duplicateds default behaviour of only keeping the first, then why not randomly shuffle the whole dataset, so that choosing the first in the shuffled set is effectively a random row from the original:

DATAr <- DATA[sample(1:nrow(DATA),]
DATAr <- DATAr[!duplicated(DATAr$Point),]

If the order of your original DATA was inportant, store the sample(...) in a variable, use that to re-order your data, and apply an inverse once you've removed duplicates (or add a column DATA$ind <- 1:nrow(DATA) and sort your data to restore this afterwards.

answered Apr 15 '14 at 13:37

Gavin Kelly

2,374
1
10
13

It does seem to work, but I couldn't understand why is it working, hehehehe. The first string command is sampling all rows, without missing any row from the original dataset?I see that the default for sample is replace=FALSE, so there won't be any duplicated row, correct? Then, in the second string command you overwrite the object DATAr only with those lines of DATAr that are note duplicated!? Bottom line: with these 2 strings I create a new data.frame with all original rows without duplicates? – Mohr Apr 15 '14 at 14:00
`sample(1:nrow(DATA))` will create a re-arrangement of the row-numbers. `replace` here is not really the same concept as `duplicate`. If I have 5 rows, then my version could produce `5,4,3,2,1` whereas the `replace=TRUE` version could produce `1,1,5,5,5`. The former will (by chance) reverse your rows, and so in this instance result in the final duplicate being kept; the latter `replace=TRUE` would create artificial duplicates of rows 1 and 5 - a _very_ bad things, so the default is the correct option. You summary is correct. – Gavin Kelly Apr 15 '14 at 14:06
Wonderful. Thanks for being attentive! – Mohr Apr 15 '14 at 14:09
Nice solution. Quick tip though: If the data is substantial, sampling all of the data might be burdensome on the compute time. It will be more efficient to just shuffle the indices and sample from there – Ricardo Saporta Apr 15 '14 at 15:42

Function for sampling between duplicated values in data.frame

3 Answers3