rollapply based on values

Question

I would like to resample a large data set with an unequal number of observations across the range in the data so that each range has an equal number of observations.

It seems like rollapply would be the way to do this, but it doesn't appear that it can be convinced to define its rolling window based on the data values?

For example:

set.seed(12345)    
z <- sort(rnorm(100,100,40))
rollapply(z, 20, function(x){sample(x,20,replace=TRUE)}, by=20)

This does a great job of taking a list of numbers and resampling it every 20 numbers, however, I would like it to start at the lowest value and resample within a regular bin of values. For the above example the (left edge) bins could be defined like:

(0:10)*(max(z)-min(z))/10+min(z)

I know I could write a for loop and do this, but I am looking for a faster / simpler method.

An input vector with unequal distribution of observations between the ranges 1:10 and 11:20: c( 1, 2, 2, 3, 3, 3, 5, 6, 7, 11, 13, 13, 20) Resampled 5 times at 2 intervals of 10 units (i.e from 1:10 and 11:20)each interval sampled 5 times could produce:

c( 3, 1, 7, 3, 2, 11,20,11,13,20)

Can you show expected output based on a minimal example? – markus Nov 29 '18 at 19:35 — markus, Nov 29 '18 at 19:35
expected input / output example added – TBP Nov 29 '18 at 19:47 — TBP, Nov 29 '18 at 19:47

score 1 · Accepted Answer · edited Sep 18 '22 at 12:53

I guess the for loop is the simplest way to do it. The solution I ultimately developed is for a data frame, but is is essentially the same solution you would use for a simple vector (as in the wording of my original question).

Fake data with an uneven distribution:

test <- data.frame(Length=rlnorm(1000,2,1), Weight=rlnorm(1000,3,2))

Define the resampling function:

resamplr <- function(data) {
  bins <- (0:9) * (max(data$Length) - min(data$Length)) / 10 + min(data$Length)    #define a vector representing the left edge of bins.
  step <- (max(data$Length) - min(data$Length)) / 10 + .000001    #define the step and add a little so you don't land on any number exactly (i.e right edge)
  result <- NULL
  for (i in 1:length(bins)) {
    temp <- data[data$Length >= bins[i] & data$Length < (bins[i] + step), ]   #select data range
    result <- rbind(result, temp[sample(nrow(temp), 10, replace = T),])  #randomly sample it with replacement, and tack it onto the resampling from the previous range.
  }
  return(result)
}

Then to use the function:

resamplr(test)

Refinements and suggestions improving the approach are of course appreciated...

rollapply based on values

1 Answers1