Subset data by randomly sampling one Site per Region but keeping rows of different Year

Question

This is an extended qestion from Randomly sample per group, make a new dataframe, repeat until all entities within a group are sampled

From an example data below, I want to produce multiple data frames by randomly sampling one Site from every Region. To make another data frame, take another random sample of Site without replacement; that is, the same Site of a given Region that were sampled in any previous sampling cannot be sampled. So, there will be as many data frames as the number of sites within regions. This part of my question was answered in the link above (although I could not find a check mark to accept that answer in that website).

My question here is for my another data frame that have data from multiple years for a given site. I want each data frame to contain unique Region-Site combination (answered in the link above) but having data from all years. Here is an example data (there are some differences in the number of years and sites for a given region):

mydf <- read.table(header = TRUE, text = 'V1 V2 Region Site Year
  5 1 A X1 2000
  1 1 A X1 2001
  5 6 A X2 2000
  2 2 A X2 2001
  8 9 A X3 2000
  5 5 A X3 2001
  3 3 B X1 2000
  2 3 B X1 2001
  3 1 B X2 2000
  4 4 B X2 2001
  7 8 B X3 2000
  1 2 C X1 2000
  9 4 C X1 2001
  4 5 C X2 2000
  6 7 C X2 2001')

Here are some expected data frames:

V1 V2 Region Site Year
5  1      A   X1 2000
1  1      A   X1 2001
3  1      B   X2 2000
4  4      B   X2 2001
1  2      C   X1 2000
9  4      C   X1 2001

V1 V2 Region Site Year
8  9      A   X3 2000
5  5      A   X3 2001
3  3      B   X1 2000
2  3      B   X1 2001
4  5      C   X2 2000
6  7      C   X2 2001

I tried to modify code provided in the link above, but it did not work. Here is the code I tried

library(data.table)
dt <- setDT(mydf)
dt <- dt[sample(.N)]
dt <- unique(dt, by = c('Year','Region'))
dt[, .SD[1], by=c("Region","Year")]

akrun · Accepted Answer · 2017-03-11T17:34:45.303

As there are not duplicate 'Year' for each 'Region/Site' combination, after converting to 'data.table' (setDT(mydf)), grouped by 'Region', we sample the unique elements of 'Site', get the row index (.I) where the sampled element is equal to the 'Site', extract the row index ($V1), use it to subset the rows of the dataset

setDT(mydf)[mydf[,  .I[Site ==sample(unique(Site), 1)], .(Region)]$V1]
#   V1 V2 Region Site Year
#1:  5  1      A   X1 2000
#2:  1  1      A   X1 2001
#3:  3  1      B   X2 2000
#4:  4  4      B   X2 2001
#5:  1  2      C   X1 2000
#6:  9  4      C   X1 2001

If we need to replicate this, we can use replicate

setDT(mydf)
lst <- replicate(5, mydf[mydf[,  .I[Site ==sample(unique(Site), 1)],
                .(Region)]$V1], simplify = FALSE)

Update

If we need to remove the 'Site' that already occurred, then use a for loop to update the original dataset with only rows that are not already sampled while we create a list of data.table ('lst1') with sampled 'Site' per 'Region'

setDT(mydf)
mydf1 <- copy(mydf)
lst1 <- vector("list", 3)
for(i in 1:3){
  tmp <- mydf[, .(Site = sample(unique(Site), 1)), Region]
  lst1[[i]] <-  mydf[tmp, on = .(Region, Site)]
   mydf <- mydf[mydf[tmp, Site != i.Site, on = .(Region)]]
 } 

lst1
#[[1]]
#   V1 V2 Region Site Year
#1:  5  6      A   X2 2000
#2:  2  2      A   X2 2001
#3:  3  3      B   X1 2000
#4:  2  3      B   X1 2001
#5:  4  5      C   X2 2000
#6:  6  7      C   X2 2001

#[[2]]
#   V1 V2 Region Site Year
#1:  5  1      A   X1 2000
#2:  1  1      A   X1 2001
#3:  7  8      B   X3 2000
#4:  1  2      C   X1 2000
#5:  9  4      C   X1 2001

#[[3]]
#   V1 V2 Region Site Year
#1:  8  9      A   X3 2000
#2:  5  5      A   X3 2001
#3:  3  1      B   X2 2000
#4:  4  4      B   X2 2001

How can I repeat this process without re-using the same sites for a given region? I want to make multiple independent data frames. — kiyoshi sasaki, Mar 11 '17 at 17:00
@kiyoshisasaki Updated the post. it is better to have a `list` of data.frames than individual data.frame objects. — akrun, Mar 11 '17 at 17:02
Replicate() produced data frames that contain the same sites. Is there a solution so that a given site of a given region appear only in one data frame? — kiyoshi sasaki, Mar 11 '17 at 17:11
@kiyoshisasaki In that case, the `sample` part becomes irrelevant after one or two draw — akrun, Mar 11 '17 at 17:17
Could you figure out why I am getting this error? Error in `[.data.table`(mydf, tmp, on = .(Region, Site)) : could not find function "." — kiyoshi sasaki, Mar 11 '17 at 17:24
@kiyoshisasaki I am using `data.table_1.10.4` If you are using an old version, it could result in error. IN that case, remove the package and install a new version — akrun, Mar 11 '17 at 17:25

Subset data by randomly sampling one Site per Region but keeping rows of different Year

1 Answers1

Update