2

I would like to create a rolling 2 quarter average for alpha, bravo and charlie (and lots of other variables. Research is taking me to zoo and lubricate packages but seem to always go back to rolling within one variable or grouping

set.seed(123)

dates <-  c("Q4'15", "Q1'16", "Q2'16","Q3'16", "Q4'16", "Q1'17", "Q2'17" ,"Q3'17", "Q4'17","Q1'18")

df <- data.frame(dates = sample(dates, 100,  replace = TRUE, prob=rep(c(.03,.07,.03,.08, .05),2)), 
                           alpha = rnorm(100, 5), bravo = rnorm(100, 10), charlie = rnorm(100, 15))

I'm looking for something like

x <- df %>% mutate_if(is.numeric, funs(rollmean(., 2, align='right', fill=NA)))

Desired result: a weighted average across "Q4'15" & "Q1'16", "Q1'16" & "Q2'16", etc for each column of data (alpha, bravo, charlie). Not looking for the average of the paired quarterly averages.

Here is what the averages would be for the Q4'15&"Q1'16" time point

df %>% filter(dates %in% c("Q4'15", "Q1'16")) %>%  select(-dates) %>% summarise_all(mean)
zx8754
  • 52,746
  • 12
  • 114
  • 209
Michael Bellhouse
  • 1,547
  • 3
  • 14
  • 26
  • 2
    It would be great if the example is a bit small and have an expected output – akrun Jun 07 '18 at 02:06
  • tried to clarify desired output in question – Michael Bellhouse Jun 07 '18 at 02:13
  • 1
    It looks like you can easily do this through loop, but you don't want to use loop? – Feng Jiang Jun 07 '18 at 02:27
  • It would be better to keep your example data short. Also, since, you considering `dates` to find quarterly average (rolling, 2 quarter), hence it make sense to keep `dates` column containing date instead of `factor`. Please update the example correctly. – MKR Jun 07 '18 at 06:12
  • done. @Jfly thanks that is a viable strategy though yes I would prefer to add onto a dplyr chain if possible. Thanks all for looking – Michael Bellhouse Jun 08 '18 at 02:02

1 Answers1

3

I like data.table for this, and I have a solution for you but there may be a more elegant one. Here is what I have:

Data

Now as data.table:

R> suppressMessages(library(data.table))
R> set.seed(123)
R> datesvec <- c("Q4'15", "Q1'16", "Q2'16","Q3'16", "Q4'16",
+               "Q1'17", "Q2'17" ,"Q3'17", "Q4'17","Q1'18")
R> df <- data.table(dates = sample(dates, 100,  replace = TRUE,
+                                 prob=rep(c(.03,.07,.03,.08, .05),2)),
+                  alpha = rnorm(100, 5),
+                  bravo = rnorm(100, 10),
+                  charlie = rnorm(100, 15))
R> df[ , ind := which(datesvec==dates), by=dates]
R> setkey(df, ind)  # optional but may as well
R> head(df)
   dates   alpha    bravo charlie ind
1: Q4'15 5.37964 11.05271 14.4789   1
2: Q4'15 7.05008 10.36896 15.0892   1
3: Q4'15 4.29080 12.12845 13.6047   1
4: Q4'15 5.00576  8.93667 13.3325   1
5: Q4'15 3.53936  9.81707 13.6360   1
6: Q1'16 3.45125 10.56299 16.0808   2
R> 

The key here is that we need to restore / maintain the temporal ordering of your quarters which your data representation does not have.

Average by quarter

This is easy with data.table:

R> ndf <- df[ ,
+           .(qtr=head(dates,1),          # label of quarter
+             sa=sum(alpha),              # sum of a in quarter
+             sb=sum(bravo),              # sum of b in quarter
+             sc=sum(charlie),            # sum of c in quarter
+             n=.N),                      # number of observations
+           by=ind]
R> ndf
    ind   qtr      sa       sb       sc  n
 1:   1 Q4'15 25.2656  52.3039  70.1413  5
 2:   2 Q1'16 65.8562 132.6650 192.7921 13
 3:   3 Q2'16 10.3422  17.8061  31.3404  2
 4:   4 Q3'16 84.6664 168.1914 256.9010 17
 5:   5 Q4'16 41.3268  87.8253 139.5873  9
 6:   6 Q1'17 42.6196  85.4059 134.8205  9
 7:   7 Q2'17 76.5190 162.0784 241.2597 16
 8:   8 Q3'17 42.8254  83.2483 127.2600  8
 9:   9 Q4'17 68.1357 133.5794 198.1920 13
10:  10 Q1'18 37.0685  78.4107 120.2808  8
R> 

Lag those averages once

R> ndf[, `:=`(psa=shift(sa),               # previous sum of a
+            psb=shift(sb),               # previous sum of b
+            psc=shift(sc),                # previous sum of c
+            pn=shift(n))]                # previous nb of obs
R> ndf
    ind   qtr      sa       sb       sc  n     psa      psb      psc pn
 1:   1 Q4'15 25.2656  52.3039  70.1413  5      NA       NA       NA NA
 2:   2 Q1'16 65.8562 132.6650 192.7921 13 25.2656  52.3039  70.1413  5
 3:   3 Q2'16 10.3422  17.8061  31.3404  2 65.8562 132.6650 192.7921 13
 4:   4 Q3'16 84.6664 168.1914 256.9010 17 10.3422  17.8061  31.3404  2
 5:   5 Q4'16 41.3268  87.8253 139.5873  9 84.6664 168.1914 256.9010 17
 6:   6 Q1'17 42.6196  85.4059 134.8205  9 41.3268  87.8253 139.5873  9
 7:   7 Q2'17 76.5190 162.0784 241.2597 16 42.6196  85.4059 134.8205  9
 8:   8 Q3'17 42.8254  83.2483 127.2600  8 76.5190 162.0784 241.2597 16
 9:   9 Q4'17 68.1357 133.5794 198.1920 13 42.8254  83.2483 127.2600  8
10:  10 Q1'18 37.0685  78.4107 120.2808  8 68.1357 133.5794 198.1920 13
R> 

Average over current and previous quarter

R> ndf[is.finite(psa),                     # where we have valid data
+     `:=`(ra=(sa+psa)/(n+pn),            # total sum / total n == avg
+          rb=(sb+psb)/(n+pn),
+          rc=(sc+psc)/(n+pn))]
R> ndf[,c(1:2, 11:13)]
    ind   qtr      ra       rb      rc
 1:   1 Q4'15      NA       NA      NA
 2:   2 Q1'16 5.06233 10.27605 14.6074
 3:   3 Q2'16 5.07989 10.03141 14.9422
 4:   4 Q3'16 5.00045  9.78935 15.1706
 5:   5 Q4'16 4.84589  9.84680 15.2496
 6:   6 Q1'17 4.66369  9.62395 15.2449
 7:   7 Q2'17 4.76554  9.89937 15.0432
 8:   8 Q3'17 4.97268 10.22195 15.3550
 9:   9 Q4'17 5.28386 10.32513 15.4977
10:  10 Q1'18 5.00972 10.09476 15.1654
R> 

taking advantage of the fact that the total sum over two quarters divided by the total number of observations is the same as the mean of those two quarters. (And this reflects an edit following an earlier thinko of mine).

Spot check

We can use the selection feature of data.table to compute two of those rows by hand, I am picked those for indices <1,2> and <4,5> here:

R> df[ ind <= 2, .(a=mean(alpha), b=mean(bravo), c=mean(charlie))]
         a      b       c
1: 5.06233 10.276 14.6074
R> df[ ind == 4 | ind == 5, .(a=mean(alpha), b=mean(bravo), c=mean(charlie))]
         a      b       c
1: 4.84589 9.8468 15.2496
R> 

This pans out fine, and the approach should scale easily to millions of rows thanks to data.table.

PS: All in One

As you mentioned pipes etc, you can write all this with chained data.table operations. Not my preferred style, but possible. The following creates the exact same out without ever creating an ndf temporary as above:

## All in one
df[ , ind := which(datesvec==dates), by=dates][
   ,
    .(qtr=head(dates,1),          # label of quarter
      sa=sum(alpha),              # sum of a in quarter
      sb=sum(bravo),              # sum of b in quarter
      sc=sum(charlie),            # sum of c in quarter
      n=.N),                      # number of observations
    by=ind][
   ,
    `:=`(psa=shift(sa),               # previous sum of a
         psb=shift(sb),               # previous sum of b
         psc=shift(sc),                # previous sum of c
         pn=shift(n))][
    is.finite(psa),                     # where we have valid data
    `:=`(ra=(sa+psa)/(n+pn),            # total sum / total n == avg
         rb=(sb+psb)/(n+pn),
         rc=(sc+psc)/(n+pn))][
    ,c(1:2, 11:13)][]
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • Thanks. BTW it may not be that the average across two quarters = the average of the two quarterly averages. I tried to express that in my original question with "Not looking for the average of the paired quarterly averages." – Michael Bellhouse Jun 09 '18 at 22:58
  • Fichtre. You are correct when the two have different counts. Dang. In better news, you _can_ in fact construct `sum` and `nobs` for each of the quarters, and then it is `(sum_i + sum_j) / (nobs_i + nobs_j)` for all pairs ``. That should hold. – Dirk Eddelbuettel Jun 09 '18 at 23:02
  • Sorry if that was not overtly clear. I dont know data table (e.g. I dont know what a nob is) but aim to learn more as I keep seeing areas where it would be more straightforward than dplyr. With all of this in mind, would you want to edit your answer to the basics of creating rolling averages over all of the records? I am going out to dinner and will accept your answer later this evening. By the way, it woudl be great if the code was easily changeable to, say roll over 4 quarters if needed. – Michael Bellhouse Jun 09 '18 at 23:11
  • 2
    `nobs` = number of observations – Dirk Eddelbuettel Jun 09 '18 at 23:11
  • Corrected version posted. – Dirk Eddelbuettel Jun 09 '18 at 23:22
  • Thanks providing useful code to further study with data.table, I accept your answer with appreciation for your help. – Michael Bellhouse Jun 10 '18 at 01:51
  • My pleasure. I added an 'all in one' solution too. Not necessarily recommended, but if you must, you can... – Dirk Eddelbuettel Jun 10 '18 at 01:58