0

I’m using R and I have a vector, lets just say vec <- c(1:10). I need to sample from this vector about 1000 times, however the sample size that I need to use is a noninteger, for example 3.66666. Obviously when I input this in, it rounds down to 3. What I’d like to do is take multiple samples, at the sizes of the two integers outside the noninteger number (for example 3 and 4). Hopefully, the output would give a series of samples, varying between sizes 3 and 4, but the average sample size of the 1000 samples would be 3.666666. If these could be stored in a matrix that would be ideal.

This is further complicated as I have a series of different non integer values that need to be used as a sample size, each sampled 1000 times also. These are currently stored in a vector, sample.size <- c(3.6666, 4.25, 5.3……)

Finally, each of the samples in the vector have a unique weight/probability for their sampling. In taking just 1 sample, I know you can create a vector representing the weight/probability for each value in the original value, however with this further complicated scenario, I don’t know even know where to begin with this.

I’m not entirely sure if this entire process can be done, nor do I really know where to start, but any help would be appreciated.

ricks.k
  • 101
  • 3
  • There are many, many distributions which would have a mean of 3.666. You need to decide how to choose the number for each sample. This is a modeling decision; not a programming decision. – MrFlick May 01 '15 at 05:28
  • 1
    @MrFlick, I don't think the question is about distributions with a mean of 3.666. It's more about the lengths of the samples. If we want the mean of sample sizes to be 3.666 and we're only considering sample sizes of 3 and 4, we could work with, say, 3 replications, one with a sample size of 3 and two with a sample size of 4. At least that's how I read this question.... – A5C1D2H2I1M1N2O1R2T1 May 01 '15 at 07:00

1 Answers1

1

One approach to handling "non-integer sample sizes" would be to create a sequence where you increment the value by the sample size each time and round. For instance, with sample size 2.5, you would have:

round(seq(0, by=2.5, length.out=10))
# [1]  0  2  5  8 10 12 15 18 20 22

Now you can see that the gaps in this sequence are 2, then 3, then 3, and 2, then 2, then ..., with an average of 2.5. You can get at these gaps with the diff function.

Now it's pretty straightforward to generate weighted samples from a set s and weights w with sample size ss:

get.samples <- function(ss, s, w) {
  sizes <- diff(round(seq(0, by=ss, length.out=1001)))
  lapply(sizes, function(x) sample(s, x, prob=w))
}

This returns a list storing the samples:

set.seed(144)
head(get.samples(3.666, 1:10, 1:10))
# [[1]]
# [1] 10  5  6  7
# 
# [[2]]
# [1]  9  6 10
# 
# [[3]]
# [1]  5 10  4  7
# 
# [[4]]
# [1] 10  6  9  8
# 
# [[5]]
# [1] 10  6  7
# 
# [[6]]
# [1]  4  8  9 10
josliber
  • 43,891
  • 12
  • 98
  • 133