Given the following data.table with financial data (35 million rows):
DT:
userId Date balance overdraft (boolean)
600 2014-11-01 -100 1
600 2014-11-02 1000 0
600 2014-11-03 -100 1
600 2014-11-04 -100 1
600 2014-11-05 100 0
600 2014-11-06 100 0
700 2014-11-01 -100 1
700 2014-11-02 1000 0
700 2014-11-03 -100 1
700 2014-11-04 -100 1
700 2014-11-05 -100 1
700 2014-11-06 100 0
The case:
a.- Total number of max. consecutive overdraft days by userId.
userId maxConsecutiveOverdraftDays
600 2
700 3
800 0
900 1
1000 5
In this case I done the following:
acum = FALSE
for (i in 1:nrow(DT)) {
if (DT[i]$overdraft == 1 ) {
if (acum == TRUE)
{
DT[i]$acumBalance <- DT[i]$balance + DT[i-1]$balance
DT[i]$totalConsecutiveOverdraftDays <- DT[i]$overdraft + DT[i-1]$overdraft
}
if (DT[i]$userId == DT[i+1]$userId
&& DT[i+1]$overdraft == 1 )
{
acum = TRUE
}
else { acum = FALSE }
}
}
DT[,maxConsecutiveOverdraftDays:=max(totalConsecutiveOverdraftDays),by=userId]
It takes more than 12 hours to finish.
How can I improve the code and reduce the computation time?
Thanks in advance.