0

I want to optimize the implementation of this formula.

Here is the formula: formula

x is an array of values. i goes from 1 to N where N > 2400000. For i=0, i-1 is the last element and for i=lastElement, i+1 is the first element. Here is the code which I have written:

   x <- 1:2400000
   re <- array(data=NA, dim = NROW(x))
   lastIndex = NROW(x)
   for(i in 1:lastIndex){
      if (i==1) {
        re[i] = x[i]*x[i] - x[lastIndex]*x[i+1]
      } else if(i==lastIndex) {
        re[i] = x[i]*x[i] - x[i-1]*x[1]
      } else {
        re[i] = x[i]*x[i] - x[i-1]*x[i+1]  
      }
    }

Can it be done by apply in R?

smci
  • 32,567
  • 20
  • 113
  • 146
Globox
  • 83
  • 1
  • 10

4 Answers4

4

We can use direct vectorization for this

# Make fake data
x <- 1:10
n <- length(x)
# create vectors for the plus/minus indices
xminus1 <- c(x[n], x[-n])
xplus1 <- c(x[-1], x[1])

# Use direct vectorization to get re
re <- x^2 - xminus1*xplus1
Dason
  • 60,663
  • 9
  • 131
  • 148
  • Awesome! Thanks Dason :) – Globox Apr 27 '17 at 18:14
  • This is creating three copies of a very large vector/array. You could avoid the copies with the padding trick, then `x[2:N]^2 - x[1:N-1]*x[3:N+1]` – smci Apr 29 '17 at 00:00
  • @Chetan: this takes 3x the memory. If x is very large, then when you run out of memory, it will reduce the performance. – smci May 01 '17 at 22:32
1

If really each x[i] is equal to i then you can do a little math:
xi^2 - (xi-1)*(xi+1) = 1
so all elements of the result are 1 (only the first and the last are not 1).
The result is:

c(1-2*N, rep(1, N-2), N*N-(N-1))

In the general case (arbitrary values in x) you can do (as in the answer from Dason):

x*x - c(x[N], x[-N])*c(x[-1], x[1])

Here is a solution with rollapply() from zoo:

library("zoo")
rollapply(c(x[length(x)],x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3]) # or:
rollapply(c(tail(x,1), x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3])

Here is the benchmark:

library("microbenchmark")
library("zoo")

N <- 10000
x <- 1:N

microbenchmark(
  math=c(1-2*N, rep(1, N-2), N*N-(N-1)), # for the data from the question
  vect.i=x*x - c(x[N], x[-N])*c(x[-1], x[1]), # general data
  roll.i=rollapply(c(x[length(x)],x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3]), # or:
  roll.tail=rollapply(c(tail(x,1), x, x[1]), width=3, function(x) x[2]^2 - x[1]*x[3])
)
# Unit: microseconds
#      expr       min         lq        mean     median         uq        max neval cld
#      math    33.613    34.4950    76.18809    36.9130    38.0355   2002.152   100  a 
#    vect.i   188.928   192.5315   732.50725   197.1955   198.5245  51649.652   100  a 
#    roll.i 56748.920 62217.2550 67666.66315 68195.5085 71214.9785 109195.049   100   b
# roll.tail 57661.835 63855.7060 68815.91001 67315.5425 71339.6045 119428.718   100   b
jogo
  • 12,469
  • 11
  • 37
  • 42
0

An lapply implementation of your formula would look like this:

x <- c(1:2400000) 
last <- length(x)

re <- lapply(x, function(i) {
    if(i == 1) {
        x[i]*x[i] - x[last]*x[i+1]
    } else if (i == last) {
        x[i]*x[i] - x[i-1]*x[1]
    } else {
        x[i]*x[i] - x[i-1]*x[i+1]  
    }
}) 

re <- unlist(re)

lapply will return a list, so conversion to a vector is done using unlist()

andseven
  • 1
  • 1
  • 2
    Use `sapply` instead of `lapply` which doesn't return a list but vector/matrix. Or even `vapply` knowing ahead the size and type of output – Parfait Apr 26 '17 at 00:48
0

1) You can avoid all the special-casing in the computation by padding the start and end of array x with copies of the last and first rows; something like this:

N <- NROW(x)
x <- rbind(x[N], x, x[1]) # pad start and end to give wraparound 

re <- lapply(2:N, function(i) { x[i]*x[i] - x[i-1]*x[i+1] } )
#re <- unlist(re) as andbov wrote

# and remember not to use all of x, just x[2:N], elsewhere

2) Directly vectorize, as @Dason's answer:

# Do the padding trick on x , then
x[2:N]^2 - x[1:N-1]*x[3:N+1]

3) If performance matters, I suspect using data.table or else for-loop on i will be faster, since it references three consecutive rows.

4) For more performance, use byte-compiling

5) If you need even more speed, use Rcpp extension (C++ under the hood) How to use Rcpp to speed up a for loop?

See those questions I cited for good examples of using lineprof and microbenchmarking to figure out where your bottleneck is.

Community
  • 1
  • 1
smci
  • 32,567
  • 20
  • 113
  • 146
  • Inside `lapply` shouldn't it be `2:N` instead of `x[2:N]`? Also, this is not performance efficient, takes a lot of time to run. – Globox Apr 26 '17 at 18:17
  • I liked the padding part. Smart move :) – Globox Apr 26 '17 at 18:21
  • 1
    @Chetan: add some random-seeded data to your question details so we can actually run an apples-to-apples comparison. *"It takes a lot of time to run"* is not specific, nor can any of the rest of us verify it. As to N>2.4 million, pick an actual value. I assume you're not blowing out your memory-limit; if you are, all bets are off. – smci Apr 27 '17 at 13:58
  • 1
    Sure, 2:N, instead of x[2:N], whatever, the intent of the code is clear. – smci Apr 27 '17 at 14:17
  • To the downvoter: there's a lot of work went into this, so tell me what you think needs improving. – smci May 01 '17 at 22:34