Aggregate data by year and override NA in row ONLY when value in column for that year

Question

Suppose I have the data frame:

 dfTest <- data.frame(year = c(1,2,3,1,3), 
               meanVal = c(1,2,3,1,3),
               var1 = c(1,2,3,1,3),
               var2 = c(NA,2,NA,1,3),
               var3 = c(1,NA,NA,1,3))


 > dfTest
    year meanVal var1 var2 var3
 1    1       1    1   NA    1
 2    2       2    2    2   NA
 3    3       3    3   NA   NA
 4    1       1    1    1    1
 5    3       3    3    3    3

What I need:

> dfTest
   year meanVal var1 var2 var3
1    1       1    1    1    1
2    2       2    2    2   NA
3    3       3    3    3    3

For this to happen, rows need to collapse by year. If there is an NA for a column, it should replace with the value in that column. i.e. there is only 1 value for var1 for year 1 (1). However some rows in the var1 column may have an NA. Further, if there is only an NA for a column like we see with year 2 var 3, the NA must remain.

score 2 · Answer 1 · answered Dec 17 '18 at 22:33

In base R this can be done with split/lapply.

res <- lapply(split(dfTest, dfTest$year), function(DF){
  c(year = unique(DF[["year"]]),
    meanVal = unique(DF[["meanVal"]]),
    colMeans(DF[3:5], na.rm = TRUE)
    )
})
res <- do.call(rbind, res)
is.na(res) <- is.nan(res)

res
#  year meanVal var1 var2 var3
#1    1       1    1    1    1
#2    2       2    2    2   NA
#3    3       3    3    3    3

score 2 · Accepted Answer · answered Dec 18 '18 at 02:35

We could also do this with aggregate from base R. Create a function to take care of cases where there is a single row for each group (or multiple rows with all NA) and that is a missing value. The default method for mean returns NaN when applied with na.rm = TRUE

aggregate(.~ year, dfTest, mean, na.rm = TRUE, na.action = NULL)

To avoid getting NaN, create the function (f1) that checks whether all the elements are NA for a particular column per group. With an if/else we can return NA for thoser cases.

f1 <- function(x) if(all(is.na(x))) NA else mean(x, na.rm = TRUE)
aggregate(.~ year, dfTest, f1, na.action = NULL)
#  year meanVal var1 var2 var3
#1    1       1    1    1    1
#2    2       2    2    2   NA
#3    3       3    3    3    3

Aggregate data by year and override NA in row ONLY when value in column for that year

2 Answers2