1

I am trying to draw from two different distributions with a probability 100000 times. Unfortunately I can't see what is wrong with my for loop, however, it only adds 1 value to simulated_data instead of the desired 100,000 values.

Question 1: How can I fix this?

Question 2: Is there a far more efficient method where I don't have to loop through 100,000 items in a list?

#creating a vector of probabilities
probabilities <- rep(0.99,100000)
#creating a vector of booleans
logicals <- runif(length(probabilities)) < probabilities

#empty list for my simulated data
simulated_data <- c()

#drawing from two different distributions depending on the value in logicals
for(i in logicals){

  if (isTRUE(i)) {
    simulated_data[i] <- rnorm(n = 1, mean = 0, sd = 1)
  }else{
     simulated_data[i] <- rnorm(n = 1, mean = 0, sd = 10)
   }
}
IRTFM
  • 258,963
  • 21
  • 364
  • 487
Ollie
  • 624
  • 8
  • 27
  • I see you solved your problem neatly, but for future reference the reason your loop failed is because when you use a `for each` loop, the `i`th term is not an index. Thus, `simulated_data[i]` is only ever instantiated as `simulated_data[TRUE]` or `simulated_data[FALSE]`, and is only ever assigned one value. One potential resolution is to use `for (i in 1:length(logicals))`. – Zach Oct 20 '16 at 04:54
  • @Zach: almost a +1, except that you shouldn't use `1:n` in loops. Use `seq_len(n)` instead (the former would do weird things if `n` turned out to be zero, or negative.). `seq_along` is another relevant function. – JDL Oct 20 '16 at 09:26

3 Answers3

1

It seems that you want to create a final sample where each element is taken randomly from either sample1 or sample2, with probabilities 0.99 and 0.01.

The correct approach would be to generate both samples, each containing the same number of elements and then select randomly from either one.

The correct approach would be:

# Generate both samples
n = 100000
sample1 = rnorm(n,0,1)
sample2 = rnorm(n,0,10)

# Create the logical vector that will decide whether to take from sample 1 or 2
s1_s2 = runif(n) < 0.99

# Create the final sample
sample = ifelse(s1_s2 , sample1, sample2)

In this case, it is not guaranteed that there are exactly 0.99*n samples from sample1 and 0.01*n from sample2. In fact:

> sum(sample == sample1)
[1] 98953

This is close to 0.99*n, as expected, but not exactly.

R. Schifini
  • 9,085
  • 2
  • 26
  • 32
1

Create a vector with the desired fraction of values from each distribution and then create a random permutation of the values:

N = 10000
frac =0.99
rand_mix = sample( c( rnorm( frac*N, 0, sd=1) , rnorm( (1-frac)*N, 0, sd=10) ) )

> table( abs(rand_mix) >1.96)

FALSE  TRUE 
 9364   636 
> (100000-636)/100000
[1] 0.99364

> table( rnorm(10000) >6)

FALSE 
10000 

The fraction is fixed. If you wante a possibly random fraction (but close to 0.99 statistically) then try this:

> table( sample( c( rnorm(10e6), rnorm(10e4, sd=10) ), 10e4) > 1.96 )

FALSE  TRUE 
97151  2849 

Compare with:

> N = 100000
> frac =0.99
> rand_mix = sample( c( rnorm( frac*N, 0, sd=1) , rnorm( (1-frac)*N, 0, sd=10) ) )
> table( rand_mix > 1.96 )

FALSE  TRUE 
97117  2883 
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I think this the best so far. Not forced like forcing the 0.99 and 0.1 exactly on the distributions. – mat4nier Oct 20 '16 at 06:02
  • Actually it is forcing there to be exactly 0.99*N from one and 0.01*N from the other, but the values are random so the fraction above a specified cut-point will vary from draw to draw. This solution may be "better" in teh sense that it returns the values in a manner that does not really allow you to know which distribution any particular value came from since the ordering is scrambled. – IRTFM Oct 20 '16 at 06:08
  • Ah so it's the same as the first solution of just generating a vector 0.99*n of 1 dist, and 0.01*n of another dist and glueing them together. – mat4nier Oct 20 '16 at 06:10
  • Modulo the randomly permuted ordering. – IRTFM Oct 20 '16 at 06:11
  • Ahh yes. Similar distribution, but order inside the vector much different. – mat4nier Oct 20 '16 at 06:11
  • Yes. The `sample` function with a single vector argument returns a random permutation. – IRTFM Oct 20 '16 at 06:13
0

Here is a nice solution for anyone here:

n <- 100000
prob1 <- 0.99
prob2 <- 1-prob1 

dist1 <- rnorm(prob1*n, 0, 1)
dist2 <- rnorm(prob2*n, 0, 10)

actual_sample <- c(dist1, dist2)
Ollie
  • 624
  • 8
  • 27