Panel data: Calculate group means while omitting first period from calculation

Question

I have an issue regarding a certain kind of mean() calculation. I use a panel data set with two indentifiers "ID" and "year" (using the plm pkg)

I want to calculate the groupwise mean of a variable "y", but omit the first year's entry of the calculation and then only fill in the calculated mean only in the years that were used to calculate it. In other words, I want to have NA in every ID's first entry of this variable.

The panel data is unbalanced, so people come and go at different points in time. Some stay from beginning till end, for others I just have data for three 3 years.

library(tidyverse)
library(plm)

ID <- c("a","a","a","a","a","b","b","b","b","c","c","c")
y <- c(9,2,5,3,3,9,1,2,3,9,2,5)
year<- c(2001,2002,2003,2004,2005,2001,2002,2003,2004,2002,2003,2004)
dt <- data.frame(ID,y,year)
dt <- pdata.frame(dt, index = c("ID","year"))

I first tried a filter over periods like so:

dt <- dt %>% group_by(ID) %>% 
  filter(year %in% first(year)+1:last(year)) %>% 
  mutate(mean.y = mean(y))

But that doesn't work, and I am not surprised to be honest but I hope you know what I want to achieve. The final result should look like this:

See how the first entry of variable y = 9 for "a-2001" is left out so that it doesnt affect the mean of individual a's other y entries (2+5+3+3)/4

i hope you people could understand it. I would massively appreciate any help. Bye

ekoam · Answer 1 · 2022-01-24T01:18:58.710

2

Here is a dplyr solution. You can calculate the mean of all values except for the first one and then use is.na<- function to assign the first element of mean.y as NA.

library(dplyr)
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L]), mean.y = `is.na<-`(mean.y, 1L))

Output

# A tibble: 12 x 4
# Groups:   ID [3]
   ID        y  year mean.y
   <chr> <dbl> <dbl>  <dbl>
 1 a         9  2001  NA   
 2 a         2  2002   3.25
 3 a         5  2003   3.25
 4 a         3  2004   3.25
 5 a         3  2005   3.25
 6 b         9  2001  NA   
 7 b         1  2002   2   
 8 b         2  2003   2   
 9 b         3  2004   2   
10 c         9  2002  NA   
11 c         2  2003   3.5 
12 c         5  2004   3.5

More compactly,

dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L])[n():1 %/% n() + 1L])

edited Jan 24 '22 at 01:18

answered Jan 23 '22 at 09:37

ekoam

8,744
1
9
22

oh wow this is amazing thank you. By the way the more compact form did not work for me but only gave me NA's but the first one does the trick. I really appreciate it – tony13s Jan 23 '22 at 09:52
1

There is a small mistake in the code. Now it should work. @tony13s – ekoam Jan 23 '22 at 09:54
BTW @tony13s. If this or any other answer has solved your question, please consider [accepting it](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) by clicking the check-mark. This indicates to the wider community that you've found a solution and gives some reputation to both the answerer and yourself. There is no obligation to do this. – ekoam Jan 23 '22 at 10:07
Thanks @ekoam, please can you explain ``is.na<-`(mean.y, 1L)`` – Alvaro Morales Jan 23 '22 at 15:56
@AlvaroMorales It is a function and returns a vector basically the same as `mean.y` except that the first element (`1L`) is replaced with an `NA` value. For instance, `\`is.na<-\`(c(1,2,3), 1L)` gives `NA 2 3` – ekoam Jan 23 '22 at 16:30
are there other functions that have the same behavior? – Alvaro Morales Jan 23 '22 at 17:03
@AlvaroMorales Every function exists for a unique reason and if two function has the same effect, why do you need them both? – ekoam Jan 24 '22 at 01:04

TimTeaFan · Answer 2 · 2022-01-23T11:23:38.077

We could work with an ifelse inside mutate. Its more code, but I think its quite readable and easy to understand whats going on.

library(tidyverse)
library(plm)

dt %>% 
  group_by(ID) %>% 
  mutate(mean.y = ifelse(year == first(year),
                         NA,
                         mean(y[year != first(year)], na.rm = TRUE)))

#> # A tibble: 12 x 4
#> # Groups:   ID [3]
#>    ID        y year  mean.y
#>    <fct> <dbl> <fct>  <dbl>
#>  1 a         9 2001   NA   
#>  2 a         2 2002    3.25
#>  3 a         5 2003    3.25
#>  4 a         3 2004    3.25
#>  5 a         3 2005    3.25
#>  6 b         9 2001   NA   
#>  7 b         1 2002    2   
#>  8 b         2 2003    2   
#>  9 b         3 2004    2   
#> 10 c         9 2002   NA   
#> 11 c         2 2003    3.5 
#> 12 c         5 2004    3.5

^{Created on 2022-01-23 by the reprex package (v0.3.0)}

yes you are right that is pretty well understandable. I guess this approach will help me also in the work to come thank you very much — tony13s, Jan 27 '22 at 12:49

Panel data: Calculate group means while omitting first period from calculation

2 Answers2