How to write a cumulative calculation in data.table

Question

A sequential, cumulative calculation

I need to make a time-series calculation, where the value calculated in each row depends on the result calculated in the previous row. I am hoping to use the convenience of data.table. The actual problem is a hydrological model -- a cumulative water balance calculation, adding rainfall at each time step and subtracting runoff and evaporation as a function of the current water volume. The dataset includes different basins and scenarios (groups). Here I will use a simpler illustration of the problem.

A simplified example of the calculation looks like this, for each time step (row) i:

 v[i] <- a[i] + b[i] * v[i-1]

a and b are vectors of parameter values, and v is the result vector. For the first row (i == 1) the initial value of v is taken as v0 = 0.

First attempt

My first thought was to use shift() in data.table. A minimal example, including the desired result v.ans, is

library(data.table)        # version 1.9.7
DT <- data.table(a = 1:4, 
                 b = 0.1,
                 v.ans = c(1, 2.1, 3.21, 4.321) )
DT
#    a   b v.ans
# 1: 1 0.1 1.000
# 2: 2 0.1 2.100
# 3: 3 0.1 3.210
# 4: 4 0.1 4.321

DT[, v := NA]   # initialize v
DT[, v := a + b * ifelse(is.na(shift(v)), 0, shift(v))][]
#    a   b v.ans v
# 1: 1 0.1 1.000 1
# 2: 2 0.1 2.100 2
# 3: 3 0.1 3.210 3
# 4: 4 0.1 4.321 4

This doesn't work, because shift(v) gives a copy of the original column v, shifted by 1 row. It is unaffected by assignment to v.

I also considered building the equation using cumsum() and cumprod(), but that won't work either.

Brute force approach

So I resort to a for loop inside a function for convenience:

vcalc <- function(a, b, v0 = 0) {
  v <- rep(NA, length(a))      # initialize v
  for (i in 1:length(a)) {
    v[i] <- a[i] + b[i] * ifelse(i==1, v0, v[i-1])
  }
  return(v)
}

This cumulative function works fine with data.table:

DT[, v := vcalc(a, b, 0)][]
#    a   b v.ans     v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
identical(DT$v, DT$v.ans)
# [1] TRUE

My question

My question is, can I write this calculation in a more concise and efficient data.table way, without having to use the for loop and/or function definition? Using set() perhaps?

Or is there a better approach all together?

Edit: A better loop

David's Rcpp solution below inspired me to remove the ifelse() from the for loop:

vcalc2 <- function(a, b, v0 = 0) {
  v <- rep(NA, length(a))
  for (i in 1:length(a)) {
    v0 <- v[i] <- a[i] + b[i] * v0
  }
  return(v)
}

vcalc2() is 60% faster than vcalc().

Probably best to look for a closed-form solution in terms of v0, a, b. — Frank, Nov 03 '16 at 23:05
Is 'difference equation' the correct term? Anyway if it is, you can't vectorize the calculation as is, unless you find a closed-form solution. — smci, Nov 04 '16 at 12:04
Thanks for your comments. By closed-form I assume you mean an algebraic expression for the integral over time. That might be conceivable for a periodic function, like sines & cosines to represent the seasonal cycle, But not in general, where a and b are functions of highly variable environmental variables like precipitation, solar radiation, etc. I think you are right that vectorization is not possible. @Frank, I don't know if the term "difference equation" applies. — Douglas Clark, Nov 04 '16 at 13:06
I meant substituting in terms recursively until you have a function of a, b and v0. Nothing about the nature of the variables matters (though, yeah, a perfectly cyclical `a` might simplify it). You have an initial condition with some updating formula and so can express the value in terms of the initial condition and other data instead of doing it recursively. — Frank, Nov 04 '16 at 14:53
I'm used to seeing these sorts of problems simplify a lot after doing this, e.g., http://stackoverflow.com/q/38577232/ but that might not be the case for your problem. Here, I think I just works out to matrix algebra and a lot of redundant computations, like `m = matrix(1, .N+1L, .N); n = col(m) - row(m) + 1L; m[] = b^n*(n >= 0L); c(c(v0, a) %*% m)` but I still think it's a worthwhile standard exercise for this sort of problem... — Frank, Nov 04 '16 at 14:53

score 7 · Accepted Answer · answered Nov 03 '16 at 23:50

It may not be 100% what you are looking for, as it does not use the "data.table-way" and still uses a for-loop. However, this approach should be faster (I assume you want to use data.table and the data.table-way to speed up your code). I leverage Rcpp to write a short function called HydroFun, that can be used in R like any other function (you just need to source the function first). My gut-feeling tells me that the data.table way (if existent) is pretty complicated because you cannot compute a closed-form solution (but I may be wrong on this point...).

My approach looks like this:

The Rcpp function looks like this (in the file: hydrofun.cpp):

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector HydroFun(NumericVector a, NumericVector b, double v0 = 0.0) {
  // get the size of the vectors
  int vecSize = a.length();

  // initialize a numeric vector "v" (for the result)
  NumericVector v(vecSize);

   // compute v_0
  v[0] = a[0] + b[0] * v0;

  // loop through the vector and compute the new value
  for (int i = 1; i < vecSize; ++i) {
    v[i] = a[i] + b[i] * v[i - 1];
  }
  return v;
}

To source and use the function in R you can do:

Rcpp::sourceCpp("hydrofun.cpp")

library(data.table)
DT <- data.table(a = 1:4, 
                 b = 0.1,
                 v.ans = c(1, 2.1, 3.21, 4.321))

DT[, v_ans2 := HydroFun(a, b, 0)]
DT
# a   b v.ans v_ans2
# 1: 1 0.1 1.000  1.000
# 2: 2 0.1 2.100  2.100
# 3: 3 0.1 3.210  3.210
# 4: 4 0.1 4.321  4.321

Which gives the result you are looking for (at least from the value-perspective).

Comparing the speeds reveals a speed-up of roughly 65x.

library(microbenchmark)
n <- 10000
dt <- data.table(a = 1:n,
                 b = rnorm(n))

microbenchmark(dt[, v1 := vcalc(a, b, 0)],
               dt[, v2 := HydroFun(a, b, 0)])
# Unit: microseconds
# expr                                min        lq       mean    median         uq       max neval
# dt[, `:=`(v1, vcalc(a, b, 0))]    28369.672 30203.398 31883.9872 31651.566 32646.8780 68727.433   100
# dt[, `:=`(v2, HydroFun(a, b, 0))]   381.307   421.697   512.2957   512.717   560.8585  1496.297   100

identical(dt$v1, dt$v2)
# [1] TRUE

Does that help you in any way?

Thanks David. I haven't used Rcpp yet. This will be fun to try. The speed up could be even more significant with the full form of model calculations. — Douglas Clark, Nov 04 '16 at 13:11
This is clearly the way to go. No point in wasting time trying to force a "native" R solution. — eddi, Nov 04 '16 at 17:56
At long last, I adopted the Rccp approach -- with quite a speedup. I also find Rccp::ccpFunction() to be very handy for prototyping and testing the cpp code inline rather than using a separate .cpp file. — Douglas Clark, Aug 30 '17 at 20:42
Oh yeah, Rcpp is wonderful, especially for code that is in the hotpath - speedups can be amazing. It gets a bit more complicated when using Rcpp in your own package, but there are many good vignettes that detail everything. — David, Aug 30 '17 at 20:43

score 2 · Answer 2 · edited May 23 '17 at 11:54

I think Reduce together with accumulate = TRUE is a commonly used technique for these types of calculations (see e.g. recursively using the output as an input for a function). It is not necessarily faster than a well-written loop*, and I don't know how data.table-esque you believe it is, still I want to suggest it for your toolbox.

DT[ , v := 0][
  , v := Reduce(f = function(v, i) a[i] + b[i] * v, x = .I[-1], init = a[1], accumulate = TRUE)]

DT
#    a   b v.ans     v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321

Explanation:

Set initial value of v to 0 (v := 0). Use Reduce to apply function f on an integer vector of row numbers except the first row (x = .I[-1]). Instead add a[1] to the start of of x (init = a[1]). Reduce then "successively applies f to the elements [...] from left to right". The successive reduce combinations are "accumulated" (accumulate = TRUE).

*See e.g. here, where you also can read more about Reduce in this section.

Thanks for the suggestions. Reduce is new to me. I will take a look. — Douglas Clark, Nov 04 '16 at 13:13
You're calling `$.data.table` needlessly in a loop. Put the function inside the `[]` and get rid of the `DT$`'s. — eddi, Nov 04 '16 at 17:58
@eddi Thanks for your (highly relevant) comment! Updated accordingly. — Henrik, Nov 04 '16 at 18:09