Most efficient/vectorization when using previous calculated value (rolling)

Question

Following those conversations:

Can I vectorize a calculation which depends on previous elements
sapply? tapply? ddply? dataframe variable based on rolling index of previous values of another variable

I wanted to test a more "real-life" case-study. I recently had to migrate SAS code to R and kdb code to R code. I tried to compiled a simple enough yet more sophisticated example to optimize.

let's build the training set

buildDF <- function(N){
    set.seed(123); dateTimes <- sort(as.POSIXct("2001-01-01 08:30:00") + floor(3600*runif(N)));
    set.seed(124); f <- floor(1+3*runif(N));
    set.seed(123); s <- floor(1+3*runif(N));
    return(data.frame(dateTime=dateTimes, f=f, s=s));
}

This is what need to be achieved

f1 <- function(DF){
    #init
    N <- nrow(DF);
    DF$num[1] = 1;

    for(i in 2:N){
        if(DF$f[i] == 2){
            DF$num[i] <- ifelse(DF$s[i-1] == DF$s[i],DF$num[i-1],1+DF$num[i-1]);        
        }else{ #meaning f in {1,3}
            if(DF$f[i-1] != 2){
                DF$num[i] = DF$num[i-1]; 
            }else{
                DF$num[i] = ifelse((DF$dateTime[i]-DF$dateTime[i-1])==0,DF$num[i-1],1+DF$num[i-1]);
            }
        }
    }
    return(DF)
}

This is off course hideous. Let's vectorize it a bit:

f2 <- function(DF){
    N <- nrow(DF);
    DF$add <- 1; DF$ds <- c(NA,diff(DF$s)); DF$lf <- c(NA,DF$f[1:(N-1)]);
    DF$dt <- c(NA,diff(DF$dateTime));
    DF$add[DF$f == 2 & DF$ds == 0] <- 0;
    DF$add[DF$f == 2 & DF$ds != 0] <- 1;
    DF$add[DF$f != 2 & DF$lf != 2] <- 0;
    DF$add[DF$f != 2 & DF$lf == 2 & DF$dt==0] <- 0;
    DF$num <- cumsum(DF$add);
    return(DF);
}

And using the most useful data.table:

f3 <- function(DT){
    N <- nrow(DT);
    DT[,add:=1]; DT[,ds:=c(NA,diff(s))]; DT[,lf:=c(NA,f[1:(N-1)])];
    DT[,dt:=c(NA,diff(dateTime))];
    DT[f == 2 & ds == 0, add:=0];
    DT[f == 2 & ds != 0, add:=1];
    DT[f != 2 & lf != 2, add:=0];
    DT[f != 2 & lf == 2 & dt == 0, add:=0];
    DT[,num:=cumsum(add)];
    return(DT);
}

On a 10K data-frame:

library(rbenchmark);
library(data.table);

N <- 1e4;
DF <- buildDF(N)
DT <- as.data.table(DF);#we can contruct the data.table as a data.frame so it's ok we don't count for this time.

#make sure everybody is equal
DF1 <- f1(DF) ; DF2 <- f2(DF); DT3 <- f3(DT);
identical(DF1$num,DF2$num,DT3$num) 
[1] TRUE

#let's benchmark
benchmark(f1(DF),f2(DF),f3(DT),columns=c("test", "replications", "elapsed",
+ "relative", "user.self", "sys.self"), order="relative",replications=1);
    test replications elapsed relative user.self sys.self
2 f2(DF)            1   0.010      1.0     0.012    0.000
3 f3(DT)            1   0.012      1.2     0.012    0.000
1 f1(DF)            1   9.085    908.5     8.980    0.072

Ok, now on a more decent 5M rows data.frame

N <- 5e6;
DF <- buildDF(N)
DT <- as.data.table(DF);
benchmark(f2(DF),f3(DT),columns=c("test", "replications", "elapsed",       
+ "relative", "user.self", "sys.self"), order="relative",replications=1);
    test replications elapsed relative user.self sys.self
2 f3(DT)            1   2.843    1.000     2.092    0.624
1 f2(DF)            1  10.920    3.841     4.016    5.137

We gain 5X with data.table.

I wonder if Rcpp or zoo:::rollapply can gain much on this. I would be happy with any suggestion

Not sure what you mean... but I thought you would hit me on the `i` statement which does not use the `.J` — statquant, Dec 30 '12 at 18:36
Using Rcpp will make your loops much faster -- as shown in the first question you yourself reference in the question. So why don't *you* try that approach? SO is full of answers demonstrating the real and tangible gains people have gotten from using Rcpp. — Dirk Eddelbuettel, Dec 30 '12 at 18:41
Dirk I am trying to write the Rcpp code but I have to confess that I am having difficulties to find examples (I installed RcppExamples but it looks empty to me :)) — statquant, Dec 30 '12 at 18:43
`N <- 1e2; DF <- buildDF(N); DT <- as.data.table(DF); DF1 <- f1(DF); DF2 <- f2(DF); DT3 <- f3(DT); identical(DF1$num,DF2$num,DT3$num)` returns `[1] FALSE` — redmode, Dec 30 '12 at 20:41
@statquant, what don't you see about my comment? Look at the help page for data.table by typing ?data.table and read about the roll argument. Search the help page for the string "roll". And let us know if it helps or not. — Matt Dowle, Dec 30 '12 at 20:54
You surely can't be having difficulties finding _any_ Rcpp examples are you? Are you subscribed to R-bloggers RSS feed? Lots of Rcpp articles there, 100 packages using it. It's hard to miss them! Given the other good questions you've asked, you'll be needing to use R and C together and Rcpp will be worth learning. See its integration with RStudio, for example. — Matt Dowle, Dec 30 '12 at 21:15
Of course not... I am having difficulties to find an example with a data.frame being modified (what I need to answer the question). I did not subscribe to that feed (did not know it existed). About RStudio I am using vim-R-plugin which I find way better. Finally I always pay a lot of attention into making sure I do more than sufficient research on any post, judging by the number of post I found on people judging that Rcpp has insufficient "very basic level example" (I do not mean to criticize), I think I can't be blame... — statquant, Dec 30 '12 at 21:23
@redmode: Thanks I think It is good now (no pb for 1e4 but 1e2... strange) — statquant, Dec 30 '12 at 22:35
@statquant: In previous edition it was an issue with `difftime` dynamic units. When 1e4 difference between `dateTime` values all in seconds, but with smaller sample there can be either seconds or minutes. Just try `N<-1e2; df<-buildDF(N)` and then `df$dateTime[6]-df$dateTime[5]` along with `df$dateTime[7]-df$dateTime[6]`. Also run `diff(df$dateTime)`. — redmode, Dec 31 '12 at 01:46

redmode · Accepted Answer · 2012-12-31T13:56:44.997

Simple inline Rcpp version:

library(Rcpp)
library(inline)

f4cxx <- cxxfunction(signature(input="data.frame"), plugin="Rcpp", body='
  Rcpp::DataFrame df(input);
  const int N = df.nrows();  

  Rcpp::NumericVector f = df["f"];
  Rcpp::NumericVector s = df["s"];
  Rcpp::NumericVector d = df["dateTime"]; // As far as we need only comparation
  Rcpp::NumericVector num(N);             // it is safe to convert Datetime to Numeric (faster) 

  num[0] = 1;
  for(int i=1; i<N; i++){
    bool cond1 = (f[i]==2) && (s[i]!=s[i-1]);
    bool cond2 = (f[i]!=2) && (f[i-1]==2) && (d[i]!=d[i-1]);
    num[i] = (cond1 || cond2)?1+num[i-1]:num[i-1];    
  }

  df["num"] = num;
  return df;                                // Returns list
  //return (Rcpp::as<Rcpp::DataFrame>(df)); // Returns data.frame (slower)
  ')

Checking out:

N<-1e4; df<-buildDF(N)
identical(f1(df)$num, f4cxx(df)$num)

[1] TRUE

Benchmarking:

N<-1e5; df<-buildDF(N); dt<-as.data.table(df)
benchmark(f2(df), f2j(df), f3(dt), f4cxx(df),
          columns=c("test", "replications", "elapsed", "relative", "user.self", "sys.self"),
          order="relative", replications=1);

       test replications elapsed relative user.self sys.self
4 f4cxx(df)            1   0.001        1     0.000        0
2   f2j(df)            1   0.037       37     0.040        0
3    f3(dt)            1   0.058       58     0.056        0
1    f2(df)            1   0.078       78     0.076        0

Can you note which version of Rcpp you used? Also, it's not really an accurate comparison if `f4cxx` returns a list and the others return a data.frame. The Rcpp function is still 2-4x faster when it returns a data.frame. — Joshua Ulrich, Dec 31 '12 at 04:32
Nice !! Many thanks, I have some stuff to process here ! Thanks and happy new year — statquant, Dec 31 '12 at 07:54
@JoshuaUlrich, initially I used 0.10.1. After I updated to 0.10.2 I gained even greater speed up. Thank you! — redmode, Dec 31 '12 at 11:15
@JoshuaUlrich: yes, returning `list` instead of `data.frame` is dirty trick when benchmarking :) The intention was to show potential source of speed up. — redmode, Dec 31 '12 at 11:17
The issue is that we lose attributes with `List::operator[]`, so the things that make a data frame a data frame are lost. — Romain Francois, Jan 02 '13 at 08:50

score 3 · Answer 2 · answered Dec 30 '12 at 23:55

3

Converting your first function to C/C++ using Rcpp (or the "plain old C API" if you want to make Dirk scratch his head) would very likely be the fastest. A data.table solution would probably be a close second.

Here's a base R solution that's much faster than your f2 function because it avoids a lot of data.frame subsetting (which is very slow). This illustrates what to do/avoid in order to make base R code fast, but at some cost of code clarity/readability.

f2j <- function(DF){
  N <- nrow(DF)
  f2  <- DF$f == 2
  ds0 <- diff(DF$s) == 0
  lf2 <- f2[-N]
  f2  <- f2[-1]
  dt3 <- diff(DF$dateTime) == 0
  cond <- logical(N)
  cond[-1] <- (f2 & ds0) | (!f2 & !lf2) | (!f2 & lf2 & dt3)
  DF$num <- cumsum(!cond)
  DF
}

answered Dec 30 '12 at 23:55

Joshua Ulrich

173,410
32
338
418

However, `N<-1e4; df<-buildDF(N); identical(f1(df)$num, f2j(df)$num)` returns `[1] FALSE` – redmode Dec 31 '12 at 01:39
@redmode, it just needs to be coerced to `numeric` if that's really desired. `all.equal(f1(df)$num, f2j(df)$num)` returns `[1] TRUE`, and so does `identical(f1(df)$num, as.numeric(f2j(df)$num))` – GSee Dec 31 '12 at 03:36
@redmode: yes, I did that on purpose because integer vectors are faster than numeric vectors and double precision isn't needed in this function. – Joshua Ulrich Dec 31 '12 at 04:14

Most efficient/vectorization when using previous calculated value (rolling)

2 Answers2

Linked