3

I would like to calculate a rolling quantile using data table, which contains data on several groups, for each group i have multiple days and within each day I have multiple observations. I do not want to calculate the rolling quantile for every single observation in the table, but always take data for the last, say, 3 days, calculate a quantile and move on.

I have data like this:

test2 <- data.table(group = rep(c("a", "b"), each = 100),
                    date = rep(rep(seq(from = as.Date('2017-01-01'),
                                    as.Date('2017-01-10'),
                                    by = "day"), each = 10), 2),
                    time = rep(rep(seq(from = 1, 10, by = 1), times = 10), 2),
                    some_data = rnorm(200) + c(1:20, 20:1, 30:1, 1:30, 30:1, 1:20, 20:1, 1:30))

The table above has 2 different groups, each group has 10 days of data, within each day there is 10 observations. Next, I want to calculate the rolling quantiles for each group.

In theory, I could calculate it like this:

tests_result <- test2[, list(date = date,
                         q_30 = rollapply(some_data,
                                          30, quantile,
                                          probs = 0.3,
                                          fill = NA, align = "right")),
                  by = "group"][seq(from = 10, to = 200, by = 10)]

But this would be painfully slow even if I used c++ based code to calculate the quantiles! So what I really want is to return the result only for the last observation in the day.

Problem 2) I have varying number of observations within each day, but still would like to be able to calculate the rolling quantile for specific number of days (as opposed to setting fixed width of the window for which to evaluate the quantile).

The output in the example scenario should look like this:

    group       date      q_30
 1:     a 2017-01-01        NA
 2:     a 2017-01-02        NA
 3:     a 2017-01-03 10.284081
 4:     a 2017-01-04  8.281827
 5:     a 2017-01-05  8.281827
 6:     a 2017-01-06  8.281827
 7:     a 2017-01-07 10.274793
 8:     a 2017-01-08  4.749455
 9:     a 2017-01-09  4.749455
10:     a 2017-01-10  9.246267
11:     b 2017-01-01        NA
12:     b 2017-01-02        NA
13:     b 2017-01-03 10.145996
14:     b 2017-01-04  5.423782
15:     b 2017-01-05  5.423782
16:     b 2017-01-06  9.741683
17:     b 2017-01-07 10.123940
18:     b 2017-01-08  4.347293
19:     b 2017-01-09  4.347293
20:     b 2017-01-10  9.177718

Summarizing the challenges:

  1. calculate the quantiles only once for each day instead of 10 times
  2. perform the quantile calculation for given number of days even when there is different number of observations in each day. i.e, if I wanted to calculate quantile based on 2 days, first day would have 10 values, 2nd day 20 values, I would get a result based on the 30 values in those two days and the final result of this calculation would be assigned to the date of the second day.

EDIT:

I figured a way how to process the dataset with the size I have. But I think it can be still much improved, so, please, if you have any suggestions, i would like to hear them.

My approach on the sample dataset would look like this:

First calculate the total number of observations in each subsequent 3 days, also calculate the row number within original dataset of the last observation in a given day. These new variables will be called in_3 and orig_row.

test3 <- test2[, list(.N, orig_row = .I[.N]), by = c("group", "date")][, list(date,in_3 = rollapply(N, 3, sum, fill = NA, align = "right"),
                                                  orig_row),
                                           by = "group"]

then use foreach package to iterate over the relevant subsets of the original data frame and calculate the quantile for each of them.

library(foreach)
quantiles <- foreach(i = 1:nrow(test3),.combine = 'c') %do% 
  ifelse(!is.na(test3[i]$V3 - test3[i]$in_3 + 1),
  test2[(test3[i]$orig_row - 3[i]$in_3 + 1):test3[i]$orig_row][,quantile(some_data, probs = 0.3)],
  NA)

And finally, assign to the aggregated dataset

test3[, `:=`(q03 = quantiles)]

I tried to run this also in parallel, but then my laptop runs out of physical memory and starts writing to disk too much, which slows the process down much more then crunching with one core only.

ira
  • 2,542
  • 2
  • 22
  • 36
  • What's your expected ouptut? And what do you mean by rolling quantile for specific number of days? That's not what your "in theory" code does. – mtoto Nov 25 '17 at 10:18
  • That's why after the code I say `and then take the last observation for each day`. I expect to have for each date in the dataset one number, which would be a quantile calculated from observations in current day, + observations in `n` previous days. – ira Nov 25 '17 at 10:23
  • Aaah, you are right, that the code I provided as an example what i want, actually does not do exactly what I thought. Adjustment will come in a moment – ira Nov 25 '17 at 10:31
  • @mtoto my bad, only now the expected result is correct. – ira Nov 25 '17 at 12:45

0 Answers0