7

The solution to this question by @ShirinYavari was almost what I needed except for the use of the static averaging window width of 2. I have a dataset with random samples from multiple stations that I want to calculate a rolling 30-day geomean. I want all samples within a 30-day window of a given sample to be averaged and the width may change if preceding samples are farther or closer together in time, for instance whether you would need to average 2, 3, or more samples if 1, 2, or more preceding samples were within 30 days of a given sample.

Here is some example data, plus my code attempt:

RESULT = c(50,900,25,25,125,50,25,25,2000,25,25,
        25,25,25,25,25,25,325,25,300,475,25)
DATE = as.Date(c("2018-05-23","2018-06-05","2018-06-17",
                  "2018-08-20","2018-10-05","2016-05-22",
                  "2016-06-20","2016-07-25","2016-08-11",
                  "2017-07-21","2017-08-08","2017-09-18",
                  "2017-10-12","2011-04-19","2011-06-29",
                  "2011-08-24","2011-10-23","2012-06-28",
                  "2012-07-16","2012-08-14","2012-09-29",
                  "2012-10-24"))
FINAL_SITEID = c(rep("A", 5), rep("B", 8), rep("C", 9))
df=data.frame(FINAL_SITEID,DATE,RESULT)

data_roll <- df %>%
  group_by(FINAL_SITEID) %>%
  arrange(DATE) %>%
  mutate(day=DATE-dplyr::lag(DATE, n=1),
         day=replace_na(day, 1),
         rnk=cumsum(c(TRUE, day > 30))) %>%
  group_by(FINAL_SITEID, rnk) %>%
  mutate(count=rowid(rnk)) %>%
  mutate(GM30=rollapply(RESULT, width=count, geometric.mean, fill=RESULT, align="right"))

I get this error message, which seems like it should be an easy fix, but I can't figure it out:

Error: Column `rnk` must be length 5 (the group size) or one, not 6
  • What do you want to achieve with `rnk=cumsum(c(TRUE, day > 30))`, i.e. why is there a `TRUE` in there and why do you build the cumulative sum?? – dario Feb 08 '20 at 12:11
  • Do you need 1. 30-days rolling window mean OR 2. mean on series of consecutive observations which are not later than 30-days from previous? – GoGonzo Feb 08 '20 at 12:20
  • I think the latter, the mean should be calculated from all consecutive observations if within a 30 day window. – Laura Diemer Feb 08 '20 at 12:41
  • OK, what if I also wanted to calculate the 90th percentile - how would the geometric.mean function change in your answer? – Laura Diemer Feb 08 '20 at 12:53

2 Answers2

3

Easiest way to compute rolling statistics depending on datetime windows is runner package. You don't have to hack around to get just 30-days windows. Function runner allows you to apply any R function in rolling window. Below example of 30-days geometric.mean within FINAL_SITEID group:

library(psych)
library(runner)
df %>%
  group_by(FINAL_SITEID) %>%
  arrange(DATE) %>%
  mutate(GM30 = runner(RESULT, k = 30, idx = DATE, f = geometric.mean))

#     FINAL_SITEID DATE       RESULT  GM30
#    <fct>        <date>      <dbl> <dbl>
# 1 C            2011-04-19     25  25.0
# 2 C            2011-06-29     25  25.0
# 3 C            2011-08-24     25  25.0
# 4 C            2011-10-23     25  25.0
# 5 C            2012-06-28    325 325. 
# 6 C            2012-07-16     25  90.1
# 7 C            2012-08-14    300  86.6
# 8 C            2012-09-29    475 475. 
# 9 C            2012-10-24     25 109. 
# 10 B            2016-05-22     50  50.0
GoGonzo
  • 2,637
  • 1
  • 18
  • 25
  • 1
    Genius! Thank you! One follow-up question - when I apply this code to my larger dataset, I noticed that if the difference between two dates is exactly 30 days, it doesn't seem to be including it in the geomean calculation. Maybe it is something to do with how the dates are formatted (is it considering a time stamp?) and it makes it slightly more than 30 days? Any clarification would be helpful. – Laura Diemer Feb 08 '20 at 12:29
  • Maybe it `as.Date` "flooring" timestamp and thus exceeds window? `runner` considers everything as integer (dates, datetimes are integers). If you provide timestamp to the `idx` then you would have to specify `k = 60*60*24*30`. You can also do something like `runner::window_run(DATE, k = 30, idx = DATE)` to check which observations are actually calculated in the window. – GoGonzo Feb 08 '20 at 13:06
  • I undestand now - if difference between two days is 30-days it means that it's a 31-days window ;) – GoGonzo Feb 08 '20 at 13:15
  • Got it - thank you! And what about using a different function other than geometric mean - how would the code change if I wanted 90th percentile? – Laura Diemer Feb 08 '20 at 15:12
  • In this case you need to create function on the fly `function(x) {...}`. For 90-th percentile example will be like: `runner(x = RESULT, k = 30, idx = DATE, f = function(x) quantile(x, probs = .9))`. Go to [documentation](https://gogonzo.github.io/runner/articles/apply_any_r_function.html), and check the first example with trimmed mean – GoGonzo Feb 08 '20 at 15:26
1

The width argument of rollapply can be a vector of widths which can be set using findInterval. An example of this is shown in the Examples section of the rollapply help file and we use that below.

library(dplyr)
library(psych)
library(zoo)

data_roll <- df %>%
  arrange(FINAL_SITEID, DATE) %>%
  group_by(FINAL_SITEID) %>%
  mutate(GM30 = rollapplyr(RESULT, 1:n() - findInterval(DATE - 30, DATE), 
   geometric.mean, fill = NA)) %>%
  ungroup

giving:

# A tibble: 22 x 4
   FINAL_SITEID DATE       RESULT  GM30
   <fct>        <date>      <dbl> <dbl>
 1 A            2018-05-23     50  50.0
 2 A            2018-06-05    900 212. 
 3 A            2018-06-17     25 104. 
 4 A            2018-08-20     25  25.0
 5 A            2018-10-05    125 125. 
 6 B            2016-05-22     50  50.0
 7 B            2016-06-20     25  35.4
 8 B            2016-07-25     25  25.0
 9 B            2016-08-11   2000 224. 
10 B            2017-07-21     25  25.0
# ... with 12 more rows
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341