2

I want to take one random Site for every Region, create a new data frame, and repeat these processes until all Site are sampled. So, each data frame will NOT contain the same Site from the same Region.

A few Regions in my real data frame have more Sites (Region C has 4 Sites) than the other Regions. I want remove those rows (perhaps I should do this before making multiple data frames).

Here is an example data frame (real one has >100 Regions and >10 Sites per Region):

mydf <- read.table(header = TRUE, text = 'V1 V2 Region Site 
5 1 A X1
5 6 A X2
8 9 A X3
2 3 B X1
3 1 B X2
7 8 B X3
1 2 C X1
9 4 C X2
4 5 C X3
6 7 C X4')

Repeating the following code for three times produces data frames that contains the same Sites for a given Region (The second and third tables both has Site X2 for Region A).

do.call(rbind, lapply(split(mydf, mydf$Region), function(x) x[sample(nrow(x), 1), ]))

  V1 V2 Region Site
A  8  9      A   X3
B  2  3      B   X1
C  6  7      C   X4

V1 V2 Region Site
A  5  6      A   X2
B  7  8      B   X3
C  9  4      C   X2

  V1 V2 Region Site
A  5  6      A   X2
B  3  1      B   X2
C  6  7      C   X4

Could you please help me create multiple data frames so that all data frames contain all Regions, but each data frame contains unique Region-Site combination.

EDIT: Here are expected output. To produce these, in the first sampling, draw one Site (row) randomly from every Region and make a data frame. In the second sampling, repeat the same process but the same Site for a given Region cannot be drawn. What I want is independent data frames that contain unique combination of Region-Site.

V1 V2 Region Site
5 1 A X1
7 8 B X3
1 2 C X1

V1 V2 Region Site
5 6 A X2
3 1 B X2
4 5 C X3

V1 V2 Region Site
8 9 A X3
2 3 B X1
9 4 C X2

2 Answers2

0

The great data.table package actually makes this very easy

# Turn mydf into a data.table 
library(data.table)
setDT(mydf)

# Shuffle the rows of the table
dt <- dt[sample(.N)]

# In case there are multiple rows for a given Region <-> Site pair,
# eliminate duplicates.
dt <- unique(dt, by = c('Region', 'Site'))

# Get the first sample from each region group
# Note: .SD refers to the sub-tables after grouping by Region
dt[, .SD[1], by=Region]

# Get the second and third sample from each region group
dt[, .SD[2], by=Region]
dt[, .SD[3], by=Region]

In fact, you could combine into a one-liner as Frank suggested

library(data.table)
dt <- setDT(mydf)
dt <- unique(dt, by = c('Region', 'Site'))
dt[sample(.N), .SD[1:3], by = Region]
andrew
  • 2,524
  • 2
  • 24
  • 36
  • The "eliminate duplicates" step can be done with `unique`, too. Re the later part, maybe just `dt[, .SD[1:3], by=Region]` since the OP just wants three sites per region. As a side note, I think it's bad practice to overwrite `dt <-` since it's harder to debug. – Frank Mar 10 '17 at 20:35
  • Emphasizing clarity for OP. Added a one-liner too. – andrew Mar 10 '17 at 20:46
  • Ok, nice one-liner. Btw, just to clarify. `dt <- dt[...]` makes the code less clear since if I want to figure out what `dt` looked like before such a step, I need to run the code again from the top. Simpler would be `newdt <- dt[...]`. A separate point: with `setDT`, there is no need to assign. `mydf` has become a data.table itself, modified by reference. – Frank Mar 10 '17 at 20:52
0

It worked! I don't see a check mark for accepting the answer, so I am doing here.