data.table - efficiently manipulate large data set

Question

I am amazed by the blazing speed of data.table. The coding below does exactly what I need however when executed on a large table it does not perform very well.

convinced that this can be done faster with data.table but I do not see how.

Output

The output needs to be a matrix with the rownames a regular sequence of days. For each column separately:

All values before the first value need to be NA
All values after the last value need to be NA
Between the first and the last value 0 need to be added as the do not exist in the input table

The following coding shows how the result should look like:

M <- 
  matrix(c(NA, NA, NA, 2, 0, 1, 3, 0, 2 , NA,
           NA, NA, 3,  1, 3, 2, 1, 2, NA, NA), 
           ncol = 2, 
           dimnames = list(as.character((Sys.Date() + 0:9)),
                           c("E1", "E2")))

Output example

##            E1 E2
## 2017-01-27 NA NA
## 2017-01-28 NA NA
## 2017-01-29 NA  2
## 2017-01-30  2  2
## 2017-01-31  0  2
## 2017-02-01  3  1
## 2017-02-02  1  3
## 2017-02-03  0  3
## 2017-02-04  2 NA
## 2017-02-05 NA NA

Input

The following table shows the source/input for the coding/function:

DS <- data.table(
  E = c(rep("E1", 4), rep("E2", 6)),
  C = c(c(Sys.Date() + c(3, 5, 6, 8)),
        c(Sys.Date() + c(2, 3, 4, 5, 6, 7))),
  S = round(runif(n = 10,min = 1, max = 3), 0),
  key = c("E", "C"))

##      E          C S
##  1: E1 2017-01-30 3
##  2: E1 2017-02-01 1
##  3: E1 2017-02-02 2
##  4: E1 2017-02-04 1
##  5: E2 2017-01-29 3
##  6: E2 2017-01-30 2
##  7: E2 2017-01-31 3
##  8: E2 2017-02-01 1
##  9: E2 2017-02-02 2
## 10: E2 2017-02-03 3

Input example

Code working

The following few lines do exactly what I need and is simple. However it is not efficient.
The real table has 700 unique C values and 2 Million E values.

# Create the regular time line per day
CL <- c(C= (Sys.Date() + 0:9))

# Determine first and last per E
DM <- DS[, .(MIN = min(C), MAX = max(C)), by =.(E)]

# Generate all combinations 
CJ <- CJ(E = DS$E, C = CL, unique = TRUE)

# Join 
DC <- DS[CJ, on = .(E, C)][!is.na(E)]

# replace NA by 0
DC[is.na(S), S:=0]

# Lead-in
DC[DM, on=.(E, C<MIN), S:=NA]

# Lead-out
DC[DM, on=.(E, C>MAX), S:=NA]

# Cast to matrix format
DC2 <- dcast(
  data = DC, formula = C ~ E, 
  fun.aggregate = sum, value.var = "S")

# coerce to matrix
M3 <- as.matrix(DC2[, -1])

# add row nanes
rownames(M3) <- format(CL, "%Y-%m-%d")

I made some long, un-readable, clumsy coding which creates the matrix with 1.2B cells in 35 secs. This must be possible as quick but far more elegant with data.table, however not like this.

"The end result should be a matrix" -- probably this means that you should be using `reshape2::acast`, which returns a matrix or array. Personally, I find your example too complicated to examine closely. Generally, *minimal* reproducible examples are easier for others to work with. See [mcve] for guidance. — Frank, Jan 24 '17 at 22:33
Hi frank, thanks. This coding should be a simple copy and paste and it will work, but I can imagine that it is too much. Can't do it shorter. That is also the reason i would like to use dcast to get the same result (probably) in combination with a cross-Join (CJ). will try acast to turn the data.table into a matrix. This picture should explain everything, thanks anyway — Floris Padt, Jan 25 '17 at 20:21
I think you should write your example as a function of `n`, number of rows, and `nc`, number of unique `C` values. With such a small example, there's no real way to benchmark alternative code to see if it's any better than yours. — Frank, Jan 27 '17 at 21:36

mnel · Answer 1 · 2017-01-24T23:03:00.570

A data.table, like a data.frame underneath everything is a list (with length = number of columns)

200 Million columns is a lot of columns - this will make anything slow.

The description of the conversion to "wide" will bloat the data with large number of NA values. You can almost certainly perform the analysis you need on the "long form" and using keys.

It isn't clear from your question what you need, but you can calculate the various sums

# convert to an IDate
DT[, CALDAY := as.IDate(CALDAY)]
# get range of dates
rangeDays <- DT[,range(CALDAY)]

all_days <- as.IDate(seq(rangeDays[1],rangeDays[2], by=1)) 
# create sums
DT_sum <- DT[, list(VALUE= sum(VALUE)), keyby = list(ENTITY, CALDAY)]

and then index using entity and dates.

 DT_sum[.("2a8605e2-e283-11e6-a3bb-bbe3fd226f8d", all_days)]

and if you need to replace NA with 0

na_replace <- function(x,repl=0){x[is.na(x)]<-repl;x}

DT_sum[.("2a8605e2-e283-11e6-a3bb-bbe3fd226f8d", all_days), na_replace(VALUE)]

Floris Padt · Answer 2 · 2017-01-28T22:05:04.720

This does the trick. But still the performance is not good.
It takes DS as input parameter. The result is a data.table which should be coerced to matrix by:

M <- as.matrix(build_timeseries_DT(DS))

Function

build_timeseries_DT <- function(DS){

  # regular time serie for complete range with index
  dtC <- data.table(
    CAL = seq(min(DS$C), max(DS$C), by = "day"))[, idx:= 1:.N]

  # add row index (idx) to sales
  DQ <- dtC[DS, on = "CAL"]
  setkey(DQ, "ENT")

  # calculate min index per ENT
  DM <- DQ[, .(MIN = min(idx), MAX = max(idx)), by = .(ENT)]

  # allocate memory, assign 0 and set rownames by reference
  DT <- dtC[, .(CAL)][, (DM[, ENT]):= 0L][, CAL:= NULL]
  setattr(DT, "row.names", format(dtC$CAL, "%Y-%m-%d"))

  # Set NA for the Lead-in and out, next populate values by ref
  for(j in colnames(DT)){
    set(x     = DT, 
        i     = c(1L:(DM[j, MIN]), (DM[j, MAX]):DT[, .N]), 
        j     = j, 
        value = NA )
    set(x     = DT, 
        i     = DQ[j, idx], 
        j     = j, 
        value = DQ[j, SLS] )}

  return(DT)
}

Test Data

DS <- data.table(
  ENT = c("A", "A", "A", "B", "B", "C", "C", "C", "D", "D"),
  CAL = c(Sys.Date() + c(0, 5, 6, 3, 8, 1, 2, 9, 3, 5)),
  SLS = as.integer(c(1, 2, 1, 2, 3, 1, 2, 3, 2, 1)),
  key = c("ENT", "CAL"))

   ENT        CAL SLS
 1:   A 2017-01-28   1
 2:   A 2017-02-02   2
 3:   A 2017-02-03   1
 4:   B 2017-01-31   2
 5:   B 2017-02-05   3
 6:   C 2017-01-29   1
 7:   C 2017-01-30   2
 8:   C 2017-02-06   3
 9:   D 2017-01-31   2
10:   D 2017-02-02   1

Result

as.matrix(build_timeseries_DT(DS))

      A  B  C  D
 [1,]  1 NA NA NA
 [2,]  0 NA  1 NA
 [3,]  0 NA  2 NA
 [4,]  0  2  0  2
 [5,]  0  0  0  0
 [6,]  2  0  0  1
 [7,]  1  0  0 NA
 [8,] NA  0  0 NA
 [9,] NA  3  0 NA
[10,] NA NA  3 NA

result with colors