1

I have a df of measurements over 50 years. I am trying to subsample the data to see what patterns I would have found had I only sampled in 2 years, or in 3, 4, 5, etc, instead of in all 50. I made a code that will pull random years from the dataset, but with the condition that those two random years are at least kind of spread out in the dataset (10 years apart, or something).

Is there any kind of conditional random sampling code?

Here's what I'm doing so far. It would be easiest to stay in this format because I %>% on to other stuff from here.

# build df
df = data.frame(year=c(1:50),
                response = runif(50,1,100))

# set number of times I'll do the simulation
number_simulations <- 5 

# set number of years I'll sample in each simulation
# (I later put this in a for loop so that I could repeat 
#  this process with more and more sample years)
number_samples <- 2



df %>% 
  
  # repeat df x number of times
  replicate(number_simulations, ., simplify = FALSE) %>%  
  
  # pick n random samples from df
  map_dfr(~ sample_n(., number_samples), .id = "simulation")

# Can I change this code to make sure sampled years aren't too close to each other? 
# years 23 and 25 out of 50 won't tell me much. But 23 and 35 would be fine. 

I'm thinking the easiest would be to create a function for sample_n_conditional() that I could just replace the sample_n in the map_dfr line. That would have to be a function that says some kind of "sample n years that are at least 10 years apart." Or even something more dynamic that depends on the number of samples, as 10 years apart will become unsustainable when I am pulling more years. So more like "sample n years that are reasonably proportionally spread out in the series."

I considered changing my total number of simulations to way more than I need then filtering out the ones that are too close together, assuming that by chance enough would meet my qualifications. But that's not ideal.

Any ideas appreciated.

Jake L
  • 987
  • 9
  • 21

1 Answers1

0

You could use a repeat loop that only breaks if threshold is above a certain value.

n.sim <- 5  ## number of simulations
n.samp <- 2  ## number of samples (also works for n.samp > 2)
thres <- 10  ## threshold

set.seed(42)
res <- replicate(n.sim, {
  repeat({
    samp <- df[sample(1:nrow(df), n.samp), ]
    if (all(abs(diff(samp[["year"]])) > thres)) break
    })
  samp
}, simplify=F)

Result

res
# [[1]]
# year  response
# 49   49 97.125694
# 37   37  1.726081
# 
# [[2]]
# year  response
# 1     1 91.565798
# 25   25  9.161318
# 
# [[3]]
# year response
# 10   10 70.80141
# 36   36 83.45869
# 
# [[4]]
# year response
# 18   18 12.63125
# 49   49 97.12569
# 
# [[5]]
# year response
# 47   47 88.88774
# 24   24 94.72016

Data:

set.seed(42)
df <- data.frame(year=1:50, response=runif(50, 1, 100))
jay.sf
  • 60,139
  • 8
  • 53
  • 110