1

Day by day I am more impressed of the capabilities of data.table and I am extremely grateful to the developers for it! I recently wrote a code that works just fine for most of the datasets, but I would like to have opinions on how to modify it to save memory and make it more efficient for really big data (> 500 million rows).

In a nutshell, I am working with time-series by individual (in R jargon: by group) where both the length of the time-series and amount of groups are very large (length > 5k and number of groups > 500k). My aim is to create a script that loads a portion of time-series (id, date, value), extract intervals of them based on another dataset (id, start_date, end_date) and evaluates a function on the extracted interval (e.g. mean). I started working on this on and off since last year (see question here) and I achieved pretty good time performances using very large datasets.

Right now I would like to extend my code to make it able to deal with more general situations, but using the data.table framework as much as possible. For instance, I want to apply this function not only to a single interval per id, but to multiple lagged intervals (specifying number of lags and time window) and this is the topic of the question. Also, I want to allow this code to work even if an individual time-series is split across datasets (but not the topic of the question).

In this case, I am using as working example a slightly modified version of data used in this tutorial:

set.seed(123)
# Red Sox home schedule (the double inclusion of Cleveland is on purpose)
fenway <- fread("Visitor, Start, End
        Cleveland, 2021-09-03, 2021-09-12
        Cleveland, 2021-09-26, 2021-10-04
        Baltimore, 2021-09-17, 2021-09-26")

#I create location-specific time-series for a month that at least partially 
#overlaps with the interval of dates defined in fenway and I also randomly include some NAs.
indmiss<-rbinom(60, size=1,prob =0.90)
rain<-runif(60)
rain[which(indmiss==0)]<-NA
september <- data.table(Visitor = rep(c("Cleveland","Baltimore"),each=30),
             date = rep(seq(as.IDate('2021-09-01'), as.IDate('2021-09-30'),
'days'),2), rain=rain) 

Like in the example I want to perform a non-equi join but, in addition to matching by dates included in the intervals, I also want to run a function that during the join create a new column called group (which represents a lag column). This column is then used in a following chained statement to define a two new columns, n and mean.

#I define a unit which I will need in the next step and I create a new variable for date 
unit<- 3
september[, date_m := date]

#My statement
home_games <- september[fenway, 
.(Visitor, End, rain, group=floor((as.numeric(End-date)/unit))),
on=.(Visitor=Visitor,date_m>=Start,date_m<=End), by=.EACHI][,
.(Visitor, End, n= length(na.omit(rain)),mean= mean(rain, na.rm=T)), 
by=c("Visitor","End","group")]

The output is correct and here it is:

     Visitor        End group n      mean
 1: Cleveland 2021-09-12     3 1 0.3839696
 2: Cleveland 2021-09-12     2 2 0.3614500
 3: Cleveland 2021-09-12     1 3 0.8055987
 4: Cleveland 2021-09-12     0 2 0.5345264
 5: Cleveland 2021-10-04     2 3 0.7709669
 6: Cleveland 2021-10-04     1 2 0.5307609
 7: Baltimore 2021-09-26     3 1 0.9144382
 8: Baltimore 2021-09-26     2 3 0.3888398
 9: Baltimore 2021-09-26     1 3 0.4324164
10: Baltimore 2021-09-26     0 3 0.6035392

I realized that this method works very fine, unless you have an enormous amount of data (e.g. 500 million rows for the time series). In that case, I get this error:

Error in `[.data.table`(september, fenway, .(Visitor, End, rain, group = floor((as.numeric(End -  : 
  negative length vectors are not allowed

Which makes me think it's a memory problem. If that is the case, I thought that maybe the issue is in the chaining, so embedding the entire code in a single statement could be a solution. So I tried to define a function that computes group and subsequently creates n and mean:

foo<- function(end, dat, exp){
  out<-setDT(list(grp = floor((as.numeric(end-dat)/unit)), 
                  exp = exp))
  return(as.list(rollup(out, j = c(list(n=length(na.omit(exp)), mean= mean(exp))),
         by = c("end","grp"))))
}

and I Included it in the statement like this:

home_games <- september[fenway,
.(Visitor,unlist(foo(end=setDT(mget("i.End")),
                        dat=setDT(mget("date")), 
                         exp=setDT(mget("rain"))))), 
                        on=.(Visitor=Visitor,date_m>=Start,date_m<=End), 
                        by=.EACHI]

But I get the following error:

Error in Ops.data.frame(fine, dat) : 
‘-’ only defined for equally-sized data frames 

I am not sure at all that I wrote my code correctly, but it was really complicated for me to find any documentation on this. Other questions helped but I cannot find a way out. On the other hand, considering that the chained statement worked I believe there might be a solution.

Thanks everyone!

jmarkov
  • 191
  • 9

1 Answers1

0

I think this is likely a memory issue and you could have duplicates in your by cols. It's difficult to test since obviously I don't have the full dataset. But I will break up your statement to show where duplicates could exist and what to do.

I suggest wrapping Visitor and End with unique() in that last portion. Otherwise, I believe that it will not be grouping correctly.

home_games <- september[
  fenway, 
  .(Visitor, 
    End, 
    rain, 
    group=floor((as.numeric(End-date)/unit))
  ),
  on=.(
    Visitor=Visitor,
    date_m>=Start, # here you are creating two columns named 'date_m' is this intentional?
    date_m<=End
  ), 
  by=.EACHI
  ][,
    # 'Visitor' and 'End' will not return what you're expecting since they need to be unique per group
    .(Visitor = unique(Visitor),
      End = unique(End),
      n = length(na.omit(rain)),
      mean = mean(rain, na.rm=T)), 
    by=c("Visitor","End","group")
    ]
      Visitor        End group   Visitor        End n      mean
 1: Cleveland 2021-09-12     3 Cleveland 2021-09-12 1 0.3839696
 2: Cleveland 2021-09-12     2 Cleveland 2021-09-12 2 0.3614500
 3: Cleveland 2021-09-12     1 Cleveland 2021-09-12 3 0.8055987
 4: Cleveland 2021-09-12     0 Cleveland 2021-09-12 2 0.5345264
 5: Cleveland 2021-10-04     2 Cleveland 2021-10-04 3 0.7709669
 6: Cleveland 2021-10-04     1 Cleveland 2021-10-04 2 0.5307609
 7: Baltimore 2021-09-26     3 Baltimore 2021-09-26 1 0.9144382
 8: Baltimore 2021-09-26     2 Baltimore 2021-09-26 3 0.3888398
 9: Baltimore 2021-09-26     1 Baltimore 2021-09-26 3 0.4324164
10: Baltimore 2021-09-26     0 Baltimore 2021-09-26 3 0.6035392
Jamie
  • 1,793
  • 6
  • 16
  • Thanks @Jamie for answering and sorry for the late reply. It seems to me that your results are the same as the ones I get without using unique(). Let me know if there are other reasons I am not understanding. – jmarkov Dec 08 '22 at 07:04
  • hmm it's hard to say without having the full dataset but I'm leaning towards this being a memory issue. Are you able to partition the data by team or even league and then rbind the results together? – Jamie Dec 08 '22 at 17:47
  • Yes I can do that. I reached a point where I realized it would be better to split the data by month and merge them together afterwards weighting each monthly average. The code is work in progress, but I think it would be very useful for people who work on massive time series datasets. In my mind it would be very nice to share it with everyone and get feedbacks but I am not sure this question would be the right spot to do it. – jmarkov Dec 09 '22 at 02:58
  • 1
    You were right, without specifying `unique()` I would have gottten the wrong results. Sometimes `data.table` is still a mistery to me. – jmarkov Mar 02 '23 at 04:15