3

as I mentioned in a previous question. I am brand new to programming and have no prior experience, but am very happy to be learning. However, I've run into the following problem, my professor has given us the following:

sim1 <- function(n) {
  xm <- matrix(nrow=n,ncol=2)
  for (i in 1:n) {
    d <- rnorm(1)
    if (runif(1) < 0.5) {
      xm[i,1] <- 1
      xm[i,2] <- 2.5*d + 69
    } else {
      xm[i,1] <- 0
      xm[i,2] <- 2*d + 64
    }
  }
  return(xm)
}

With the following task: Try to improve the efficiency of this code. Use speed.test to see if it is improved for generating n=1000 observations.

I have finally at least been able to figure out what this code does, nonetheless, I am completely lost on how I could possibly make this code more efficient.

Any help means a whole lot. Thank you!

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    What have you tried till now to improve efficiency? – MKR Dec 12 '17 at 22:57
  • generally the approach would be to take advantage of r's vectorized functions. for example, `set.seed(1); rnorm(2)` and `set.seed(1); rnorm(1); rnorm(1)` are identical, the first only needs to call `rnorm` once which saves you computation time. Are you supposed to reproduce the results of `sim1` exactly? I don't think you can given how this function is written, eg, `set.seed(1); rnorm(2); runif(2)` and `set.seed(1); rnorm(1); runif(1); rnorm(1); runif(1)` will only match on the first random number – rawr Dec 12 '17 at 23:00
  • I really like this question. What you may not realize, being new to programming, is that it hits on a common problem that developers have in Java, R, and all programming. Doing things *one at a time* instead of all at once or a chunk at a time is very inefficient. – DarkerIvy Dec 13 '17 at 17:35

3 Answers3

2

I'll do what I think is the most obvious step, namely to move rnorm() out of the loop and take advantage of its vectorized nature (as rawr alluded to)

sim2 <- function(n) {
    xm <- matrix(nrow=n, ncol=2)
    d <- rnorm(n)
    for (i in 1:n) {
        if (runif(1) < 0.5) {
            xm[i,1] <- 1
            xm[i,2] <- 2.5*d[i] + 69
        } else {
            xm[i,1] <- 0
            xm[i,2] <- 2*d[i] + 64
        }
    }
    return(xm)
}

n <- 1e3
set.seed(1); system.time(s1 <- sim1(n)); system.time(s2 <- sim2(n))
#  user  system elapsed 
# 0.019   0.004   0.023 
#  user  system elapsed 
# 0.010   0.000   0.009 

t.test(s1[,2], s2[,2]) # Not identical, but similar, again alluded to by rawr

Just that gives us a reasonable improvement. A similar thing can be done with runif() as well, but I'll leave that to you.

If you want some reading material I can recommend Hadley Wickhams Advanced R and the chapter Optimising code.

And in case you're wondering, it is indeed possible to eliminate both the loop and the conditionals.

AkselA
  • 8,153
  • 2
  • 21
  • 34
2

If possible, don't use loops in R. rep and rnorm will fill vectors with 5, 10, or 500,000 values all in one call, very quickly. Calling rnorm(1) 500,000 times is a waste and much slower than simply calling rnorm(500000). It's like taking a Ferrari for a drive, going 1 foot and stopping, going 1 foot and stopping, over and over to get to your destination.

This function will return statistically identical results as your function. However, instead of using loops, it does things in the R way.

sim2 <- function(n) {
    n1 <- floor(n/2)  #this is how many of the else clause we'll do
    n2 <- n - n1  #this is how many of the if clause we'll do
    col11 <- rep(0, n1) #bam! we have a vector filled with 0s
    col12 <- (rnorm(n1) * 2) + 64 #bam! vector filled with deviates
    col21 <- rep(1, n2) #bam! vector filled with 1s
    col22 <- (rnorm(n2) * 2.5) + 69 #bam! vector filled with deviates
    xm <- cbind(c(col11,col21), c(col12,col22)) #now we have a matrix, 2 cols, n rows
    return(xm[sample(nrow(xm)),]) #shuffle the rows, return matrix
}

No loops! The functionality might be obvious but in case it is not, I'll explain. First, n1 & n2 are simply to split the size of n appropriately (accounting for odd numbers).

Next, the binomial process (i.e., if(runif(1) < 0.5) {} else {}) per element can be eliminated since we know that in sim1, half of the matrix falls into the if condition and half in the else (see proof below). We don't need to decide for each element over and over and over which random path to take when we know that it's 50/50. So, we're going to do ALL the else 50% first: we fill a vector with n/2 0s (col11) and another with n/2 random deviates (mean = 0, sd = 1 by default) and, for each deviate, multiply by 2 and add 64, with result vector col12. That 50% is done.

Next, we finish the second 50% (the if portion). We fill a vector with n/2 1s (col21) and another with random deviates and, for each deviate, multiply by 2.5 and add 69.

We now have 4 vectors that we'll turn into a matrix. STEP 1: We glue col11 (filled with n/2 0s) and col21 (filled with n/2 1s) together using the c function to get a vector (n elements). STEP 2: Glue col12 and col22 together (filled with the deviates) using c to get a vector (like a 1 column x n row matrix). Note: 0s/1s are associated with the correct deviates based on 64/69 formulas. STEP 3: Use cbind to make a matrix (xm) out of the vectors: 0/1 vector becomes column 1, deviate vector becomes column 2. STEP 4: Get the number of rows in the matrix (which should just be n) using nrow. STEP 5: Make a shuffled vector with all the row numbers randomly ordered using sample. STEP 6: Make a new (unnamed) matrix putting xm's rows in order according to the shuffled vector. The point of steps 4-6 is just to randomly order the rows, since the binomial process in sim1 would have produced a random order of rows.

This version runs 866% faster!

> system.time({ sim1(500000)})
   user  system elapsed 
  1.341   0.179   1.527 
> system.time({ sim2(500000)})
   user  system elapsed 
  0.145   0.011   0.158 

If you're concerned about proof that this maintains the integrity of the binomial process, consider that the binomial process does two things: 1) It associates 1 with the 2.5*d+69 equation and 0 with the 2*d + 64 equation - the association is maintained since rows are shuffled intact; 2) 50% go in the if clause and 50% in the else clause, as proved below.

sim3 <- function(n) {
    a <- 0
    for(j in 1:n) {
        if(runif(1) < 0.5) {
            a <- a + 1
        }
    }
    return(a/n)
}
> sim3(50)
[1] 0.46
> sim3(5000)
[1] 0.4926
> sim3(10000)
[1] 0.5022
> sim3(5000000)
[1] 0.4997844

The binomial process produces 50% 1s and 50% 0s (column 1).

DarkerIvy
  • 1,477
  • 14
  • 26
  • Did you read the question? This is homework, she's supposed to do some of it herself. – AkselA Dec 13 '17 at 00:45
  • @AkselA I'm not her nanny. Why she wants to know isn't my business. Answering the question is my objective. – DarkerIvy Dec 13 '17 at 00:48
  • Your function won't return a statistically identicall result (or rather, only asymptotically so). Your function has a static split between the two distributions, in `sim1()` this split is the result of a binomial process. – AkselA Dec 13 '17 at 01:01
  • @AkselA Binomial process based on a *uniform* distribution. For any reasonable size of `n`, you'll get 50% split, which is identical to what you get with my version. Mine also maintains association between the 1s and the 69 equation and the 0s and the 64, preserving everything the binomial process provides. – DarkerIvy Dec 13 '17 at 01:15
  • Except stochasticity. – AkselA Dec 13 '17 at 01:27
  • @DarkerIvy hello thank you so much for your help; as I previously mentioned, this is my first time ever in a programming environment. I've written down your code and I will study it further to better understand how the loops were vectorized. Thank you. – Salma Yazmine Atiya Dec 13 '17 at 02:06
  • @SalmaYazmineAtiya I edited my response a little to make it clearer. If it still doesn't make sense, let me know. R does things in a beautiful way but it takes a minute to think that way. You'll get it! – DarkerIvy Dec 13 '17 at 14:20
  • @DarkerIvy Yes! The comments made it way better!! Thank you so much!! – Salma Yazmine Atiya Dec 13 '17 at 19:19
  • @SalmaYazmineAtiya If you like an answer, accept it as the solution. ;) – DarkerIvy Dec 13 '17 at 19:29
0

One optimization I can suggest is that you create the matrix with default value as 0. Once matrix has been created with 0 value as default then there will be no need to populate a value 0 in function.

The modified code will look like:

sim1 <- function(n) {
#create matrix with 0 value. 
xm <- matrix(0,nrow=n,ncol=2) 
for (i in 1:n) {
d <- rnorm(1)
if (runif(1) < 0.5) {
 xm[i,1] <- 1
 xm[i,2] <- 2.5*d + 69
} else {
 #xm[i,1] <- 0    --- No longer needed
 xm[i,2] <- 2*d + 64
}
}
return(xm)
}
MKR
  • 19,739
  • 4
  • 23
  • 33