R: select a subset based on probability

Question

I'm new to R. I have a normal distribution.

n <- rnorm(1000, mean=10, sd=2)

As an exercise I'd like to create a subset based on a probability curve derived from the values. E.g for values <5, I'd like to keep random 25% entries, for values >15, I'd like to keep 75% random entries, and for values between 5 and 15, I'd like to linearly interpolate the probability of selection between 25% and 75%. Seems like what I want is the "sample" command and its "prob" option, but I'm not clear on the syntax.

Currently there is an issue with the formulation. Once the sample is realized, we know exactly what 25% and 75% are for <5 and >15 subsamples. But 25% and 75% also aren't really probabilities; they are just proportions. Now when talking about (5,15) and interpolation, it sounds like each point should be picked with a certain (interpolated) probability, which will ultimately lead to a random number of points selected; that's a different mechanism than with <5 and >15 cases. Is that what you are after? — Julius Vainora, Feb 21 '19 at 00:33
Yes, correct: for the (5,15) range, the sample size is not known upfront. — MrSparkly, Feb 21 '19 at 00:35

score 1 · Accepted Answer · answered Feb 21 '19 at 00:55

For the first two subsets we may use

idx1 <- n < 5
ss1 <- n[idx1][sample(sum(idx1), sum(idx1) * 0.25)]
idx2 <- n > 15
ss2 <- n[idx2][sample(sum(idx2), sum(idx2) * 0.75)]

while for the third one,

idx3 <- !idx1 & !idx2
probs <- (n[idx3] - 5) / 10 * (0.75 - 0.25) + 0.25
ss3 <- n[idx3][sapply(probs, function(p) sample(c(TRUE, FALSE), 1, prob = c(p, 1 - p)))]

where probs are linearly interpolated probabilities for each of element of n[idx3]. Then using sapply we draw TRUE (take) or FALSE (don't take) for each of those elements.

score 0 · Answer 2 · answered Feb 21 '19 at 00:37

The prob option in sample() gives weigths of probability to the vector to sample.

https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample

So if I understood the question right what you want is to sample only 25% of the values < 5 and 75% for values > 75 and so on ..

Then you have to use the n parameter

As documentation says

n a positive number, the number of items to choose from. See ‘Details.’

There you could input the % of sample you want multiplied by the length of the sample vector.

For your last sample you could add a uniform variable to run from .25 to .75 runif()

Hope this helps!

R: select a subset based on probability

2 Answers2