R data.table optimising. Overdraft financial data

Question

Given the following data.table with financial data (35 million rows):

DT:

userId  Date         balance  overdraft (boolean)
600     2014-11-01   -100     1
600     2014-11-02   1000     0
600     2014-11-03   -100     1
600     2014-11-04   -100     1
600     2014-11-05   100      0
600     2014-11-06   100      0
700     2014-11-01   -100     1
700     2014-11-02   1000     0
700     2014-11-03   -100     1
700     2014-11-04   -100     1
700     2014-11-05   -100     1
700     2014-11-06   100      0

The case:

a.- Total number of max. consecutive overdraft days by userId.

userId  maxConsecutiveOverdraftDays   
600     2
700     3
800     0
900     1
1000    5

In this case I done the following:

acum = FALSE

for (i in 1:nrow(DT)) {

    if (DT[i]$overdraft == 1 ) {
      if (acum == TRUE) 
  {

        DT[i]$acumBalance <- DT[i]$balance + DT[i-1]$balance 
        DT[i]$totalConsecutiveOverdraftDays   <- DT[i]$overdraft + DT[i-1]$overdraft
  }

      if (DT[i]$userId == DT[i+1]$userId 
      && DT[i+1]$overdraft == 1 ) 
  {

        acum = TRUE
  }  
    else { acum = FALSE }
}
}

DT[,maxConsecutiveOverdraftDays:=max(totalConsecutiveOverdraftDays),by=userId]

It takes more than 12 hours to finish.

How can I improve the code and reduce the computation time?

Thanks in advance.

score 2 · Accepted Answer · answered Dec 15 '14 at 23:29

2

Can't say if this will help your performance problem, but rle is helpful here for nice short code. Since the value of overdraft is always zero or one, we can take the max of a product of length and value:

> aggregate(overdraft~userId, df, FUN=function(x) {
+   r <- rle(x)
+   max(r$lengths * r$values)
+ })
  userId overdraft
1    600         2
2    700         3

answered Dec 15 '14 at 23:29

Neal Fultz

9,282
1
39
60

`data.table` adaptation: `DT[, rle(overdraft), by=userId][, max(lengths * values), by=userId]` – BrodieG Dec 15 '14 at 23:52
proc.time() - timer user system elapsed 6.90 0.28 7.18 – José Enrique Pérez Fernández Dec 16 '14 at 09:58
12 hours to 7 seconds? Whoa. – Neal Fultz Dec 16 '14 at 17:36

Arun · Answer 2 · 2015-01-07T19:28:05.717

Update: #686 is now implemented in 1.9.5. From NEWS:

7) rleid(), a convenience function for generating a run-length type id column to be used in grouping operations is now implemented. Closes #686. Check ?rleid examples section for usage scenarios.

With this function, we can do:

require(data.table) ## 1.9.5+
dt[, .(overdraft = overdraft[1L], .N), by=.(userId, 
      rleid(overdraft))][overdraft == 1L, max(N), by=userId]

There's a FR #686 open to implement such a function. But this'll have to be implemented along with fast rank function. We have not gotten to it yet.

Until then, you can do it this way:

dt[, .(overdraft = overdraft[1L], .N), by=.(userId, 
    cumsum(c(TRUE, diff(overdraft) != 0L)))][overdraft == 1L, max(N), by=userId]
#    userId V1
# 1:    600  2
# 2:    700  3

Edit: Corrected logic as pointed out by @Dirk. Thanks!

proc.time() - timer user system elapsed 6.90 0.28 7.18 – José Enrique Pérez Fernández Dec 16 '14 at 09:58 — José Enrique Pérez Fernández, Dec 16 '14 at 09:58

Dirk · Answer 3 · 2014-12-18T23:15:35.270

This one-liner should do the trick:

overdraft[, daysoverdraft:=seq(.N)-1, by=cumsum(overdraft == 0)]

Let's test with a simple set of data:

overdraft = data.table(userId=rep(1:10^3, each=350),
                       date=rep(1:350, 10^3),
                       balance=round(rnorm(35*10^3)*100),
                       overdraft=0)
overdraft[balance < 0, overdraft:=1]

With 350k rows it runs acceptable fast (~1.2 seconds on my laptop), however, not even close to the speed of Arun's answer, which is at least 20x faster.

R data.table optimising. Overdraft financial data

3 Answers3

Update: #686 is now implemented in 1.9.5. From NEWS: