Day by day I am more impressed of the capabilities of data.table
and I am extremely grateful to the developers for it! I recently wrote a code that works just fine for most of the datasets, but I would like to have opinions on how to modify it to save memory and make it more efficient for really big data (> 500 million rows).
In a nutshell, I am working with time-series by individual (in R jargon: by group) where both the length of the time-series and amount of groups are very large (length > 5k and number of groups > 500k). My aim is to create a script that loads a portion of time-series (id, date, value), extract intervals of them based on another dataset (id, start_date, end_date) and evaluates a function on the extracted interval (e.g. mean). I started working on this on and off since last year (see question here) and I achieved pretty good time performances using very large datasets.
Right now I would like to extend my code to make it able to deal with more general situations, but using the data.table framework as much as possible. For instance, I want to apply this function not only to a single interval per id, but to multiple lagged intervals (specifying number of lags and time window) and this is the topic of the question. Also, I want to allow this code to work even if an individual time-series is split across datasets (but not the topic of the question).
In this case, I am using as working example a slightly modified version of data used in this tutorial:
set.seed(123)
# Red Sox home schedule (the double inclusion of Cleveland is on purpose)
fenway <- fread("Visitor, Start, End
Cleveland, 2021-09-03, 2021-09-12
Cleveland, 2021-09-26, 2021-10-04
Baltimore, 2021-09-17, 2021-09-26")
#I create location-specific time-series for a month that at least partially
#overlaps with the interval of dates defined in fenway and I also randomly include some NAs.
indmiss<-rbinom(60, size=1,prob =0.90)
rain<-runif(60)
rain[which(indmiss==0)]<-NA
september <- data.table(Visitor = rep(c("Cleveland","Baltimore"),each=30),
date = rep(seq(as.IDate('2021-09-01'), as.IDate('2021-09-30'),
'days'),2), rain=rain)
Like in the example I want to perform a non-equi join but, in addition to matching by dates included in the intervals, I also want to run a function that during the join create a new column called group
(which represents a lag column). This column is then used in a following chained statement to define a two new columns, n
and mean
.
#I define a unit which I will need in the next step and I create a new variable for date
unit<- 3
september[, date_m := date]
#My statement
home_games <- september[fenway,
.(Visitor, End, rain, group=floor((as.numeric(End-date)/unit))),
on=.(Visitor=Visitor,date_m>=Start,date_m<=End), by=.EACHI][,
.(Visitor, End, n= length(na.omit(rain)),mean= mean(rain, na.rm=T)),
by=c("Visitor","End","group")]
The output is correct and here it is:
Visitor End group n mean
1: Cleveland 2021-09-12 3 1 0.3839696
2: Cleveland 2021-09-12 2 2 0.3614500
3: Cleveland 2021-09-12 1 3 0.8055987
4: Cleveland 2021-09-12 0 2 0.5345264
5: Cleveland 2021-10-04 2 3 0.7709669
6: Cleveland 2021-10-04 1 2 0.5307609
7: Baltimore 2021-09-26 3 1 0.9144382
8: Baltimore 2021-09-26 2 3 0.3888398
9: Baltimore 2021-09-26 1 3 0.4324164
10: Baltimore 2021-09-26 0 3 0.6035392
I realized that this method works very fine, unless you have an enormous amount of data (e.g. 500 million rows for the time series). In that case, I get this error:
Error in `[.data.table`(september, fenway, .(Visitor, End, rain, group = floor((as.numeric(End - :
negative length vectors are not allowed
Which makes me think it's a memory problem. If that is the case, I thought that maybe the issue is in the chaining, so embedding the entire code in a single statement could be a solution.
So I tried to define a function that computes group
and subsequently creates n
and mean
:
foo<- function(end, dat, exp){
out<-setDT(list(grp = floor((as.numeric(end-dat)/unit)),
exp = exp))
return(as.list(rollup(out, j = c(list(n=length(na.omit(exp)), mean= mean(exp))),
by = c("end","grp"))))
}
and I Included it in the statement like this:
home_games <- september[fenway,
.(Visitor,unlist(foo(end=setDT(mget("i.End")),
dat=setDT(mget("date")),
exp=setDT(mget("rain"))))),
on=.(Visitor=Visitor,date_m>=Start,date_m<=End),
by=.EACHI]
But I get the following error:
Error in Ops.data.frame(fine, dat) :
‘-’ only defined for equally-sized data frames
I am not sure at all that I wrote my code correctly, but it was really complicated for me to find any documentation on this. Other questions helped but I cannot find a way out. On the other hand, considering that the chained statement worked I believe there might be a solution.
Thanks everyone!