-1

How do I apply rollapplyr on the following data to allow it be sensitive to the date field? Because currently I am able to apply the rolling (blind to the date) over the dataset with eg. 4-quarters period and minimum of 2 observations in the 4 quarters.

#creating the data
   set.seed(123)
    data.frame(id=c(1,1,1,1,1,2,2,2,2,2), 
               date=as.Date(as.character(c(20040930, 20041231, 20050331, 20050630, 20050930, 20040930, 20050331, 20050630, 20051231, 20060331)), format = "%Y%m%d"),
               col_a=round(runif(10, 0, 100),0),
               col_b=round(runif(10, 0, 100),0))

   id       date col_a col_b
1   1 2004-09-30     3    10
2   1 2004-12-31     8     5
3   1 2005-03-31     4     7
4   1 2005-06-30     9     6
5   1 2005-09-30     9     1
6   2 2004-09-30     0     9
      <missing>
7   2 2005-03-31     5     2
8   2 2005-06-30     9     0
      <missing>
9   2 2005-12-31     6     3
10  2 2006-03-31     5    10

This is what I have attempted so far, but this will not take into consideration of the missing records, eg. id=2's 2005-09-30 record.

library(zoo)
data %>%
  group_by(id) %>% 
  mutate(score = (col_a + col_b) / rollapplyr(col_b, 4, mean, fill=NA, by.column=TRUE, partial=2)) %>% 
  ungroup %>% select(id, date, col_a, col_b, score)

And this is what I got after applying the above function

      id date       col_a col_b score
   <dbl> <date>     <dbl> <dbl> <dbl>
 1     1 2004-09-30     3    10 NA   
 2     1 2004-12-31     8     5  1.73
 3     1 2005-03-31     4     7  1.5 
 4     1 2005-06-30     9     6  2.14
 5     1 2005-09-30     9     1  2.11
 6     2 2004-09-30     0     9 NA   
 7     2 2005-03-31     5     2  1.27
 8     2 2005-06-30     9     0  2.45
 9     2 2005-12-31     6     3  2.57
10     2 2006-03-31     5    10  4   

However what I am expecting is it will take into consideration the missing quarters itself automatically. This is my expected output

       id date       col_a col_b score
   <dbl> <date>     <dbl> <dbl> <dbl>
 1     1 2004-09-30     3    10 NA   
 2     1 2004-12-31     8     5  1.73
 3     1 2005-03-31     4     7  1.5 
 4     1 2005-06-30     9     6  2.14
 5     1 2005-09-30     9     1  2.11
 6     2 2004-09-30     0     9 NA   
                <missing>
 7     2 2005-03-31     5     2  1.27
 8     2 2005-06-30     9     0  2.45
                <missing>
 9     2 2005-12-31     6     3  **5.4**
10     2 2006-03-31     5    10  **3.46**  

Note that the "<missing>" will not be shown in the output, I just put for visual purpose. So eg. row 10 will only use row 8,9 and 10's records because the missing row is counted as a row too. How do I achieve that?

Note that eg. for row 10, n=3 should be used for the averaging not n=4 as it shouldn't include the missing rows.

halfer
  • 19,824
  • 17
  • 99
  • 186
yeeen
  • 4,911
  • 11
  • 52
  • 73
  • are you using the same `set.seed` as I can't reproduce your data – akrun Oct 06 '18 at 18:00
  • 1
    @akrun correct ya i ran set.seed(123) can reproduce the data – yeeen Oct 06 '18 at 18:19
  • I have a doubt, why is the row 8 not using n = 3. After the row 6, there is a `` – akrun Oct 06 '18 at 19:42
  • @akrun I put "**" cos it is a change from the previous output. It is also using n=3, but it has no records beyond that so it takes only 3 rounds. Whereas round 10 and 9 average over 4 records, which should be changed to 3 cos there is one missing record – yeeen Oct 07 '18 at 02:49

1 Answers1

2

One option would be to create the complete rows of 'date' for all 'id's before the group_by

library(tidyverse)
library(zoo)
complete(data, id, date, fill = list(col_a = 0, col_b = 0)) %>% 
      group_by(id) %>% 
      mutate(score = (col_a + col_b) / 
         rollapplyr(col_b, 4, sum, fill=NA, by.column=TRUE, partial=2)) %>% 
      ungroup %>% 
      select(id, date, col_a, col_b, score) %>%
      right_join(data)
# A tibble: 10 x 5
#      id date       col_a col_b  score
#   <dbl> <date>     <dbl> <dbl>  <dbl>
# 1     1 2004-09-30     3    10 NA    
# 2     1 2004-12-31     8     5  0.867
# 3     1 2005-03-31     4     7  0.5  
# 4     1 2005-06-30     9     6  0.536
# 5     1 2005-09-30     9     1  0.526
# 6     2 2004-09-30     0     9 NA    
# 7     2 2005-03-31     5     2  0.636
# 8     2 2005-06-30     9     0  0.818
# 9     2 2005-12-31     6     3  1.8  
#10     2 2006-03-31     5    10  1.15 

data

data <- structure(list(id = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2),
  date = structure(c(12691, 
 12783, 12873, 12964, 13056, 12691, 12873, 12964, 13148, 13238
 ), class = "Date"), col_a = c(3, 8, 4, 9, 9, 0, 5, 9, 6, 5), 
col_b = c(10, 5, 7, 6, 1, 9, 2, 0, 3, 10)), row.names = c(NA, 
 -10L), class = "data.frame")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks, but i realised sth... it sum up including the missing rows... i.e. if i want to do mean or sd it doesn't work... Cos for row 10, it shld be n=3 not n=4 – yeeen Oct 06 '18 at 18:25
  • @yeeen It is based on the example and fucntion you used – akrun Oct 06 '18 at 18:26
  • i understand... actually i realised that and wanted to amend the qn but u alr replied before i can amend it... So sorry. Do u hv a soln for the revised qn? – yeeen Oct 06 '18 at 18:28
  • @yeeen You may need to update with your expected output for mean – akrun Oct 06 '18 at 18:29