0

I have a database with sales value for individual firms that belong to different industries. In the example dataset below:

set.seed(123)
df <- data.table(year=rep(1980:1984,each=4),sale=sample(100:150,20),ind=sample(LETTERS[1:2],20,replace = TRUE))
df[order(year,ind)]
    year sale ind
 1: 1980  114   A
 2: 1980  102   A
 3: 1980  130   B
 4: 1980  113   B
 5: 1981  136   A
 6: 1981  148   A
 7: 1981  141   B
 8: 1981  142   B
 9: 1982  124   A
10: 1982  125   A
11: 1982  104   A
12: 1982  126   B
13: 1983  108   A
14: 1983  128   A
15: 1983  140   B
16: 1983  127   B
17: 1984  134   A
18: 1984  107   A
19: 1984  106   A
20: 1984  146   B

The column "ind" represents industry and I have omitted the firm identifiers (no use in this example). I want an average defined as follows:

For each year, the desired average is the average of all firms within the industry over the past three years. If the data for past three years is not available, a minimum of two observations is also acceptable.

For example, in the above dataset, if year=1982, and ind=A, there are only two observations for past years (which is still acceptable), so the desired average is the average of all sale values in years 1980 and 1981 for industry A.

If year=1983, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1980, 1981, and 1982 for industry A.

If year=1984, and ind=A, we have three prior years, and the desired average is the average of all sale values in years 1981, 1982, and 1983 for industry A.

The desired output, thus, will be as follows:

    year sale ind   mymean
 1: 1980  130   B       NA
 2: 1980  114   A       NA
 3: 1980  113   B       NA
 4: 1980  102   A       NA
 5: 1981  141   B       NA
 6: 1981  142   B       NA
 7: 1981  136   A       NA
 8: 1981  148   A       NA
 9: 1982  124   A 125.0000
10: 1982  125   A 125.0000
11: 1982  126   B 131.5000
12: 1982  104   A 125.0000
13: 1983  140   B 130.4000
14: 1983  127   B 130.4000
15: 1983  108   A 121.8571
16: 1983  128   A 121.8571
17: 1984  134   A 124.7143
18: 1984  107   A 124.7143
19: 1984  146   B 135.2000
20: 1984  106   A 124.7143

A data.table solution is much preferred for fast implementation. Many thanks in advance.

lovestacksflow
  • 521
  • 3
  • 14

3 Answers3

2

I am not very good in data.table. Here is one tidyverse solution if you like or if you can translate it to data.table

library(tidyverse)

df %>% group_by(ind, year) %>% 
  summarise(ds = sum(sale),
            dn = n()) %>%
  mutate(ds = (lag(ds,1)+lag(ds,2)+ifelse(is.na(lag(ds,3)), 0, lag(ds,3)))/(lag(dn,1)+lag(dn,2)+ifelse(is.na(lag(dn,3)), 0, lag(dn,3)))
  ) %>% select(ind, year, mymean = ds) %>%
  right_join(df, by = c("ind", "year"))

`summarise()` regrouping output by 'ind' (override with `.groups` argument)
# A tibble: 20 x 4
   ind    year mymean  sale
   <chr> <int>  <dbl> <int>
 1 A      1980    NA    114
 2 A      1980    NA    102
 3 A      1981    NA    136
 4 A      1981    NA    148
 5 A      1982   125    124
 6 A      1982   125    125
 7 A      1982   125    104
 8 A      1983   122.   108
 9 A      1983   122.   128
10 A      1984   125.   134
11 A      1984   125.   107
12 A      1984   125.   106
13 B      1980    NA    130
14 B      1980    NA    113
15 B      1981    NA    141
16 B      1981    NA    142
17 B      1982   132.   126
18 B      1983   130.   140
19 B      1983   130.   127
20 B      1984   135.   146
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • So which strategy you actually wanted, `mean of all` OR `mean of means`? – AnilGoyal Dec 26 '20 at 13:46
  • 1
    I actually wanted the values of `mean of all previous three year values`, which is what you went for – lovestacksflow Dec 26 '20 at 14:45
  • I actually upvoted this question for this very reason. Someone lateron however nullified my upvote. But still I want to learn from R experts that whether such calculation can be made through rolling computation. Let's see if someone proposes another answer – AnilGoyal Dec 26 '20 at 15:04
  • I've posted an answer (I'm no expert by any means though!), check it out and see if it helps. – lovestacksflow Dec 26 '20 at 15:16
2

You can use zoo's rollapply function to perform this rolling calculation. Note that there are dedicated functions to calculate rolling mean like frollmean in data.table and rollmean in zoo but they lack the argument partial = TRUE present in rollapply. partial = TRUE is useful here since you want to calculate the mean even if the window size is less than 3.

We can first calculate mean of sale value for each ind and year, then perform rolling mean calculation with window size of 3 and join this data with the original dataframe to get all the rows of original dataframe back.

library(data.table)
library(zoo)

df1 <- df[, .(sale = mean(sale)), .(ind, year)]
df2 <- df1[, my_mean := shift(rollapplyr(sale, 3, function(x) 
  if(length(x) > 1) mean(x, na.rm = TRUE) else NA, partial = TRUE)), ind]
df[df2, on = .(ind, year)]

This can be written using dplyr as :

library(dplyr)

df %>%
  group_by(ind, year) %>%
  summarise(sale = mean(sale)) %>%
  mutate(avg_mean = lag(rollapplyr(sale, 3, partial = TRUE, function(x) 
                    if(length(x) > 1) mean(x, na.rm = TRUE) else NA))) %>% 
  left_join(df, by = c('ind', 'year'))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Ronak, I had a doubt here. Since it is not mandatory that each year has equal number of rows (years and sale), pre-calculating mean and calculating it again for some previous groups will not give desired values. You code gives `125.889` as mean for `A` & `1984` as against desired `124.7143`. 125.889 is actually the mean of three previous years' means which is not required. Required value is mean of all values of previous three years. – AnilGoyal Dec 26 '20 at 13:08
  • 1
    @AnilGoyal Hmmm...I had some doubt about how OP is calculating mean as well. As far as I could understand value in 1984 is mean for 1981, 1982 and 1983. It doesn't include 1984 value that is why I have used `shift`/`lag` here. – Ronak Shah Dec 26 '20 at 13:23
  • 1
    To use `rollapply` with prior 3 values use `rollapply(sale, list(-(3:1)), mean, partial = TRUE, fill = NA)` eliminating need for `shift`. The second argument means use offsets -3, -2 and -1. – G. Grothendieck Dec 26 '20 at 13:33
  • Thanks @G.Grothendieck. I am always confused about how vector window size works in `zoo`. – Ronak Shah Dec 26 '20 at 13:46
  • @lovestacksflow, I followed long strategy because I thought you need `mean of all previous three year values` instead of `mean of three previous years' means`. I am still not sure whether former can be calculated through rollapply? – AnilGoyal Dec 26 '20 at 13:49
  • 1
    @ronak-shah I double checked and 124.7143 should be the right answer. The values in industry A from 1981 to 1983 are (136, 148, 124, 125, 104, 108, 128), and the average is 124.7143. – lovestacksflow Dec 26 '20 at 14:17
  • 1
    @AnilGoyal I double checked and you're right, the values of `mean of all previous three year values` are wanted – lovestacksflow Dec 26 '20 at 14:21
  • If x is a vector n long then in rollapply(x, w, f) w can be (1) a numeric n-vector such that w[i] is width to use at position i in conjuction with `align`. If w is less than n long it is recycled to be n long. (2) Alternately w can be an n long list such that w[[i]] is the offsets to apply at position i. If w is less than n long it is recycled. Thus if w is a list with one element (that element would be a numeric vector of offsets) then that one component would be recycled to n identical components & applied to every position. This seems consistent with how R works. – G. Grothendieck Dec 26 '20 at 14:31
1

Based on Ronak's answer (the mean of previous means), a more general way (the mean of all previous values), and a data.table solution then can be:

library(data.table)
library(roll)

df1 <- df[, .(sum_1 = sum(sale), n=length(sale)), .(ind, year)]
df1[,`:=`(
  my_sum = roll_sum(shift(sum_1),3,min_obs = 2),
  my_n = roll_sum(shift(n),3,min_obs = 2)
  ),by=.(ind)]
df1[,`:=`(my_mean=(my_sum/my_n))]
> df[df1[,!c("sum_1","n","my_sum","my_n")] ,on = .(ind, year)]
    year sale ind  my_mean
 1: 1980  130   B       NA
 2: 1980  113   B       NA
 3: 1980  114   A       NA
 4: 1980  102   A       NA
 5: 1981  141   B       NA
 6: 1981  142   B       NA
 7: 1981  136   A       NA
 8: 1981  148   A       NA
 9: 1982  124   A 125.0000
10: 1982  125   A 125.0000
11: 1982  104   A 125.0000
12: 1982  126   B 131.5000
13: 1983  140   B 130.4000
14: 1983  127   B 130.4000
15: 1983  108   A 121.8571
16: 1983  128   A 121.8571
17: 1984  134   A 124.7143
18: 1984  107   A 124.7143
19: 1984  106   A 124.7143
20: 1984  146   B 135.2000
lovestacksflow
  • 521
  • 3
  • 14