1

I'm new to R. I have a normal distribution.

n <- rnorm(1000, mean=10, sd=2)

As an exercise I'd like to create a subset based on a probability curve derived from the values. E.g for values <5, I'd like to keep random 25% entries, for values >15, I'd like to keep 75% random entries, and for values between 5 and 15, I'd like to linearly interpolate the probability of selection between 25% and 75%. Seems like what I want is the "sample" command and its "prob" option, but I'm not clear on the syntax.

MrSparkly
  • 627
  • 1
  • 7
  • 17
  • Currently there is an issue with the formulation. Once the sample is realized, we know exactly what 25% and 75% are for <5 and >15 subsamples. But 25% and 75% also aren't really probabilities; they are just proportions. Now when talking about (5,15) and interpolation, it sounds like each point should be picked with a certain (interpolated) probability, which will ultimately lead to a random number of points selected; that's a different mechanism than with <5 and >15 cases. Is that what you are after? – Julius Vainora Feb 21 '19 at 00:33
  • Yes, correct: for the (5,15) range, the sample size is not known upfront. – MrSparkly Feb 21 '19 at 00:35

2 Answers2

1

For the first two subsets we may use

idx1 <- n < 5
ss1 <- n[idx1][sample(sum(idx1), sum(idx1) * 0.25)]
idx2 <- n > 15
ss2 <- n[idx2][sample(sum(idx2), sum(idx2) * 0.75)]

while for the third one,

idx3 <- !idx1 & !idx2
probs <- (n[idx3] - 5) / 10 * (0.75 - 0.25) + 0.25
ss3 <- n[idx3][sapply(probs, function(p) sample(c(TRUE, FALSE), 1, prob = c(p, 1 - p)))]

where probs are linearly interpolated probabilities for each of element of n[idx3]. Then using sapply we draw TRUE (take) or FALSE (don't take) for each of those elements.

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
0

The prob option in sample() gives weigths of probability to the vector to sample.

https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/sample

So if I understood the question right what you want is to sample only 25% of the values < 5 and 75% for values > 75 and so on ..

Then you have to use the n parameter

As documentation says

n a positive number, the number of items to choose from. See ‘Details.’

There you could input the % of sample you want multiplied by the length of the sample vector.

For your last sample you could add a uniform variable to run from .25 to .75 runif()

Hope this helps!