Rapidly generating ~ 10^9 steps of a random process in R

Question

I have a following task to perform:

Generate 10^9 steps of the process described by the formula:
X(0)=0
X(t+1)=X(t)+Y(t)
where Y(t) are independent random variables with the distribution N(0,1). Calculate in what percentage of indices t the value of X(t) was negative.

I tried the following code:

  x<-c(0,0)
  z<-0
  loop<-10^9
  for(i in 2:loop) {
    x[1]<-x[2]
    x[2]<-x[1]+rnorm(1, 0, 1)
    if (x[2]<0) {z<-z+1}
  }

However, it is very slow. How can I speed it up?

I am confused why your for loop starts at 2 (2:loop) instead of 1:loop. Doesn't this execute the process only 10^9-1 times? — Uwe, Dec 24 '17 at 03:59
@wojciesz: It is customary to accept an answer (click on the checkmark under the voting buttons) if you feel it has addressed your question. You have five answers to choose from here. If you feel none has addressed your question then please explain why in the comments or in an update to your question. — Claus Wilke, Dec 27 '17 at 20:39

Claus Wilke · Answer 1 · 2017-12-23T18:19:58.180

In general, for problems like this, you can translate your function one-to-one into C++ using the Rcpp package. This should give a considerable speedup.

First, the R version:

random_sum <- function(loop = 1000) {
  x<-c(0,0)
  z<-0
  for(i in 2:loop) {
    x[1]<-x[2]
    x[2]<-x[1]+rnorm(1, 0, 1)
    if (x[2]<0) {z<-z+1}
  }
  z / loop
}
set.seed(123)
random_sum()
# [1] 0.134

Now the C++ version:

library("Rcpp")
cppFunction("
  double random_sum_cpp(unsigned long loop = 1000) {
    double x1 = 0;
    double x2 = 0;
    double z = 0;
    for (unsigned long i = 2; i < loop; i++) {
      x1 = x2;
      x2 = x1 + Rcpp::rnorm(1)[0];
      if (x2 < 0) z = z+1;
    }
    return z/loop;
  }")

set.seed(123)
random_sum_cpp()
# [1] 0.134

For completeness, let's also consider the vectorized version that was proposed:

random_sum_vector <- function(loop = 1000) {
  Y = rnorm(loop)
  sum(cumsum(Y)<0)/loop
}
set.seed(123)
random_sum_vector()
# [1] 0.134

We see that it gives the same result for the same random seed, so it seems a viable contender.

In the benchmark, the C++ version and the vectorized version perform similarly, with the vectorized version displaying a slight edge over the C++ version:

> microbenchmark(random_sum(100000),
                 random_sum_vector(100000),
                 random_sum_cpp(100000))
Unit: milliseconds
                     expr        min         lq       mean     median         uq       max neval
        random_sum(1e+05) 184.205588 199.859266 209.220232 205.137043 211.026740 274.47615   100
 random_sum_vector(1e+05)   6.320690   6.631704   7.273645   6.799093   7.334733  18.48649   100
    random_sum_cpp(1e+05)   8.950091   9.362303  10.663295   9.956996  11.079513  21.30898   100

However, the vectorized version trades off speed with memory and will blow up your memory for long loops. The C++ version uses virtually no memory.

For 10^9 steps, the C++ version runs in about 2 minutes (110 seconds) on my machine. I didn't try the R version. Based on the shorter benchmarks, it would probably take around 7 hours.

> microbenchmark(random_sum_cpp(10^9), times = 1)
Unit: seconds
                 expr      min       lq     mean   median       uq      max neval
 random_sum_cpp(10^9) 110.2182 110.2182 110.2182 110.2182 110.2182 110.2182     1

G5W · Answer 2 · 2017-12-23T17:09:40.270

4

This should be much faster, but a billion of anything may take a while. It might be good to test this out with smaller values of length - like 10^6.

length = 10^9
Y = rnorm(length)
sum(cumsum(Y)<0)/length

EDIT

Based on the comments of @user3666197 I tested this and he was correct. This solution works well for smaller numbers but once the number of steps gets to be too large, it fails.

I tested my "vectorized" version against the OP's code. When the length of the random walk was 10^8, my code took about 7 seconds and the OP's code took 131 seconds (on my laptop). However, when I increased the length to 10^9 (as per the original question), my version caused a lot of disk swapping and I had to kill the process. This solution fails at the scale requested by the OP.

edited Dec 23 '17 at 17:09

answered Dec 23 '17 at 15:28

G5W

36,531
10
47
80

With all due respect, the proposed testing on just about a 1E+6 long series of values is **principally wrong, if trying to seriously benchmark and compare execution performance** -- CPU L3-Cache will still be able to retain the whole series. Rigorous testing requires controlled injection of cache de-coherence steps, so that the resulting execution times do not get principally skewed for in-cache computed runs, while other runs remain realistic with real-world code-execution, altogether with the non-cached memory fetches ( which the 1E+9 sizes always do exercise ) – user3666197 Dec 23 '17 at 15:46
Your proposed approach may slightly gain a performance advantage in [TIME]-domain, but by a trick at an incredible cost in a [SPACE]-domain. Your approach will stop work if memory allocation will not fit into available free RAM memory space and will literally commit suicide, if let to introduce a principally uncontrollable memory-swap-in/out system death. One might be pleasantly surprised, that a **"naive" sequential process is principally immune to this risk of system fatal-death ( by being suffocated to death by memory swapping ).** – user3666197 Dec 23 '17 at 15:57
1

And the "naive" sequential process is much faster anyways (~200x), if you don't implement the loop in R. – Claus Wilke Dec 23 '17 at 16:31
1

You are right. My version thrashes at 10^9. Instead of deleting, I will modify as a warning. – G5W Dec 23 '17 at 17:00

score 3 · Accepted Answer · answered Dec 23 '17 at 17:34

One solution is to go with the vectorized proposed by @G5W, but break it into smaller chunks to avoid any memory overflow issues. This gives you the speed of the vectorized solution, but by managing the chunk size you can control how much memory the process uses.

The following breaks the problem into blocks of 1e+07, and by looping 100 times you get a total of 1e+09.

At the end of the first block, you record the percent of time below 0, and the ending point. The ending point is then fed to the next block, and you record the percent of time below 0, and the new ending point.

At the end, average the 100 runs to get the total amount of time below zero. The calls to cat in the while loop are to monitor progress and see the progression, this can be commented out.

funky <- function(start, length = 1e+07) {
  Y <- rnorm(length)
  Z <- cumsum(Y)
  c(sum(Z<(-start))/length, (tail(Z, 1) + start))
}

starttime <- Sys.time()
resvect <- vector(mode = "numeric", length = 100)
result <- funky(0)
resvect[1] <- result[1]
i <- 2
while (i < 101) {
  cat(result, "\n")
  result <- funky(result[2])
  resvect[i] <- result[1]
  i <- i + 1
}
mean(resvect)
# [1] 0.1880392
endtime <- Sys.time()
elapsed <- endtime - starttime
elapsed
# Time difference of 1.207566 mins

score 0 · Answer 4 · answered Dec 23 '17 at 15:32

Given a source of randomness is technically constructed as a capability of an otherwise deterministic hardware to fulfill both a requirement for a repeatability of the generated stream and also all the conditions for "generated"-randomness by a certain pseudo-random generator algorithm, such source of randomness is not easily convertible from a pure-[SERIAL] into any form of a "just"-[CONCURRENT] or a true-[PARALLEL] modus operandi.

This said, the PRG-step is the central point ( blocking ) for any attempt to redefine a pure-[SERIAL] code-execution.

This does not change the percentage of (non)-negative X(t)-values, but just determines that for a given PRG-hardware implementation, there is no shorter way, but the pure-[SERIAL] sequence of generation of mutually (serially) dependent values.

Unrolling the "slow" loop or quasi-(_{as the values are still serially dependent})-vectorised procesing ( R-language implementation features harnessing but almost hardware CPU-instruction set level tricks -- so not a language game-changer, but a bit of circumventing some knowingly slow code-execution constructors ) is the most one can expect to happen.

score 0 · Answer 5 · answered Dec 23 '17 at 17:47

Using vectors will generally yield better performance than for loops. The issue here with very large numbers (i.e. 10^9) is memory limitations. Since you are only interested in the final percentage of indices that are negative, the following will work (takes a couple of minutes at 10^9 steps).

update_state <- function (curr_state, step_size) {
  n <- min(curr_state$counter, step_size)
  r <- rnorm(min(curr_state$counter, step_size))
  total <- curr_state$cum_sum + cumsum(r)

  list('counter' = curr_state$counter - n,
       'neg_count' = curr_state$neg_count + length(which(total < 0)),
       'cum_sum' = curr_state$cum_sum + sum(r))
}


n <- 10^9
curr_state <- list('counter' = n, 'neg_count' = 0, 'cum_sum' = 0)

step_size <- 10^8
while (curr_state$counter > 0) {
  curr_state <- update_state(curr_state = curr_state, step_size = step_size)
}

print(curr_state)
print(curr_state$neg_count/ n)

Rapidly generating ~ 10^9 steps of a random process in R

5 Answers5

EDIT