Techniques to Speed up rollapply with Custom Function (r)

Question

I have a rollapply function that does something very simple, however over million data points this simple function is quite slow. I would like to know if it is possible to provide information to rollapply for how to make the next transition rather than defining the function itself.

Concretely, I am performing a rolling window for a basic statistical anomaly detection.

Roll apply function:

minmax <- function(x) { max(x) - min(x) }

invoked by:

mclapply(data[,eval(vars),with=F], 
         function(x) rollapply(x,width=winSize,FUN=minmax,fill=NA),
         mc.cores=8)

Where data is a 8 column data.table and winsize is 300

This call takes about 2 mins on 8 cores. It is one of the major bottlenecks to the overall computing. However I can imagine that we can keep them sorted (by value and index) and then do Olog(n) comparisons each time we slide.

However I often see posts suggesting to move away from for loops and use lapply. What is a next logical step to further optimize?

If the posts you're referring to (suggesting `lapply` over loops) are a few years old, then they might be saying that because `for` loops use to be slower. That has since been fixed (R-3.0?), so the premise that `for` loops are always *slower* is perhaps ill-advised. There are times when it makes more sense to *not* use a `for` loop (when you don't want side-effect and want to capture the result of each iteration in the loop), in which case one of the `*apply` functions may make more sense. — r2evans, Feb 07 '19 at 20:04
When using parallel functions like `mclapply`, my biggest concern would be how much data is being passed between the cores. For instance, if you go from `mc.cores=8` to (say) `=4`, does your execution time double? What about `=12` (if you have more)? If you do not see clearly-improving performance with more cores, you might be experiencing lag due to copying data around (depending on your parallel setup). So perhaps speeding this up is not just related to your UDF (which is about as efficient as you can get in its raw form). — r2evans, Feb 07 '19 at 20:08
Since I am new to r I did not know about performance changes in the for loops. I will see how a for loop sorted window performs w.r.t. rollapply. 8 cores was chosen consciously as there are 8 rollapplies (and 8 cores). Im going to move to a new machine with 36 cores and maybe then I can investigate the Amdahls Law effects. After I figure out remote r in ess. thanks for the advice — channon, Feb 07 '19 at 20:19
Remote R in ESS: depends on what you need. For me, I have emacs/ess installed on the remote system, so `ssh` and `tmux` (`screen`) suffice; when I need to see plots, I use [`rmote`](https://github.com/cloudyr/rmote). I've read about [ESS-Remote](https://ess.r-project.org/Manual/ess.html#ESS_002dremote) but have never tried it. — r2evans, Feb 07 '19 at 20:27

score 3 · Accepted Answer · answered Feb 07 '19 at 20:14

Not sure if/how this would apply in the mclapply environment, but you can gain a little speedup by employing zoo's optimized rollmax function. Since they don't have a complementing rollmin, you'll need to adapt.

minmax <- function(x) max(x) - min(x)
aa <- runif(1e4)
identical(
  zoo::rollapply(aa, width=100, FUN=minmax, fill=NA),
  zoo::rollmax(aa, k=100, fill=NA) + zoo::rollmax(-aa, k=100, fill=NA)
)
# [1] TRUE

microbenchmark::microbenchmark(
  minmax = zoo::rollapply(aa, width=100, FUN=minmax, fill=NA),
  dblmax = zoo::rollmax(aa, k=100, fill=NA) + zoo::rollmax(-aa, k=100, fill=NA)
)
# Unit: milliseconds
#    expr     min      lq     mean   median      uq      max neval
#  minmax 70.7426 76.0469 84.81481 77.99565 81.8047 148.8431   100
#  dblmax 15.6755 17.4501 19.09820 17.93665 18.8650  52.4849   100

(The improvement will depend on the window size, so your results might vary, but I think using an optimized function zoo::rollmax will almost always out-perform calling a UDF each time.)

I did not know rollmax, I cant imagine it would be too much work to implement a rollmin optimized. — channon, Feb 07 '19 at 20:21
It is just: `rollmin <- function(x, ...) -zoo::rollmax(-x, ...)`. — r2evans, Feb 07 '19 at 20:24

score 3 · Answer 2 · edited Feb 07 '19 at 22:15

If you really want to edge out as much performance as you can, use Rcpp. Custom loops are a great use case for C++, especially when your function is pretty simple.

First results then code:

microbenchmark::microbenchmark(
  minmax = zoo::rollapply(aa, width=100, FUN=minmax, fill=NA),
  dblmax = zoo::rollmax(aa, k=100, fill=NA) + zoo::rollmax(-aa, k=100, fill=NA),
  cminmax = crollapply(aa, width=width), times = 10
)
    Unit: milliseconds
    expr       min         lq       mean    median         uq        max neval cld
  minmax 154.04630 162.728871 188.198416 173.13427 200.928005 298.568673    10   c
  dblmax  37.38127  38.541603  44.818505  41.42796  50.001888  61.024250    10  b 
 cminmax   2.31766   2.363676   2.406835   2.39237   2.438109   2.512162    10 a

C++/Rcpp code:

#include <Rcpp.h>
#include <algorithm>
using namespace Rcpp;

// [[Rcpp::export]]
std::vector<double> crollapply(std::vector<double> aa, int width) {
  if(width > aa.size()) throw exception("width too large :(");
  int start_offset = (width-1) / 2;
  int back_offset = width / 2;
  std::vector<double> results(aa.size());
  int i=0;
  for(; i < start_offset; i++) {
    results[i] = NA_REAL;
  }
  for(; i < results.size() - back_offset; i++) {
    double min = *std::min_element(&aa[i - start_offset], &aa[i + back_offset + 1]);
    double max = *std::max_element(&aa[i - start_offset], &aa[i + back_offset + 1]);
    results[i] = max - min;
  }
  for(; i < results.size(); i++) {
    results[i] = NA_REAL;
  }
  return results;
}

R code:

library(dplyr)
library(zoo)
library(microbenchmark)
library(Rcpp)

sourceCpp("~/Desktop/temp.cpp")

minmax <- function(x) max(x) - min(x)
aa <- runif(1e4)
width <- 100
x1 <- zoo::rollapply(aa, width=width, FUN=minmax, fill=NA)
x3 <- crollapply(aa, width=width)
identical(x1,x3)

width <- 101
x1 <- zoo::rollapply(aa, width=width, FUN=minmax, fill=NA)
x3 <- crollapply(aa, width=width)
identical(x1,x3)

microbenchmark::microbenchmark(
  minmax = zoo::rollapply(aa, width=100, FUN=minmax, fill=NA),
  dblmax = zoo::rollmax(aa, k=100, fill=NA) + zoo::rollmax(-aa, k=100, fill=NA),
  cminmax = crollapply(aa, width=width), times = 10
)

To really edge out as much performance as you can, would it help to remove the `std::` bindings and do it low-level brute-force? (That may be getting ridiculous ... your speed-ups here are impressive.) — r2evans, Feb 07 '19 at 21:54
I'm not a c++ guru, so that was not really a suggestion so much as another question, @thc — r2evans, Feb 07 '19 at 22:38
No, `std` means the standard c++ library. It is supposed to be extremely well optimized, so use it whenever you can ;) — thc, Feb 07 '19 at 23:30
I was sufficiently aware of what it *was* but have no experience how good it is in the overall sense, so thank you for that reassurance. My experience with `RcppSugar` is that it is good and makes code more maintainable and readable, but it does have a *slight* performance penalty, so I projected that degradation on `std::` as well. Thanks! — r2evans, Feb 08 '19 at 00:08

Techniques to Speed up rollapply with Custom Function (r)

2 Answers2