Replace NA with mean of variable grouped by time and treatment

Question

I have a data frame, similar to the one below (see dput), recording responses of a variable to a treatment over time:

df <- structure(list( time = c(0, 0, 0, 0, 0, 0, 14, 14, 14, 14, 14, 14, 33, 33, 33, 33, 33, 33, 90, 90, 90, 90, 90, 90),
                      trt = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L),
                      .Label = c("1", "2"), class = "factor"), 
               A1 = c(6.301, 5.426, 5.6021, NA, NA, NA, 6.1663, 6.426, 6.8239, 2.301, 4.7047, 2.301, 5.8062, 4.97, 4.97, 2.301, 2.301, 2.301, 2.301, 2.301, 2.301, 2.301, 2.301, 2.301),
               B1 = c(5.727, 5.727, 5.4472, NA, NA, NA, 6.6021, 7.028, 7.1249, 3.028, 3.1663, 3.6021, 5.727, 5.2711, 5.2389, 3.3554, 3.9031, 4.2389, 3.727, 3.6021, 3.6021, 3.8239, 3.727, 3.426)),
               row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"))

which looks lie this:

    time trt      A1    B1
   <dbl> <fct> <dbl> <dbl>
 1     0 2      6.30  5.73
 2     0 2      5.43  5.73
 3     0 2      5.60  5.45
 4     0 1     NA    NA   
 5     0 1     NA    NA   
 6     0 1     NA    NA   
 7    14 2      6.17  6.60
 8    14 2      6.43  7.03
 9    14 2      6.82  7.12
10    14 1      2.30  3.03

In our experiments, we don’t always record values for all treatments at time == 0. I want to replace any missing values (NA) when (and only when) time == 0 with the mean of the trt ‘2’ group at time == 0. So NA in A1 all become 5.78, and those in B1 become 5.63.

Using answers from here and here, as well as some others, I have been able to come up with the following:

df %>% 
  mutate_if(is.numeric, funs(if_else(is.na(.),if_else(time == 0, 0, .), .)))

This replaces NA at time == 0 with 0 (this is useful for some of my variables where there is no data in any of the treatments at time == 0, but not what i'm after here). I also tried this:

df %>% 
  mutate_if(is.numeric, funs(if_else(is.na(.),if_else(time == 0, mean(., na.rm = TRUE), .), .)))

This is closer to what I want, but is averaging the values from the whole column/variable. Can I make it average only those values from treatment ‘2’ when time == 0?

score 2 · Accepted Answer · answered May 15 '20 at 14:56

2

I think I would just use indexing in base R for this:

within(df, {A1[is.na(A1) & time == 0] <- mean(A1[trt == "2" & time == 0])
            B1[is.na(B1) & time == 0] <- mean(B1[trt == "2" & time == 0])})
#> # A tibble: 24 x 4
#>     time trt      A1    B1
#>    <dbl> <fct> <dbl> <dbl>
#>  1     0 2      6.30  5.73
#>  2     0 2      5.43  5.73
#>  3     0 2      5.60  5.45
#>  4     0 1      5.78  5.63
#>  5     0 1      5.78  5.63
#>  6     0 1      5.78  5.63
#>  7    14 2      6.17  6.60
#>  8    14 2      6.43  7.03
#>  9    14 2      6.82  7.12
#> 10    14 1      2.30  3.03
#> # ... with 14 more rows

^{Created on 2020-05-15 by the reprex package (v0.3.0)}

answered May 15 '20 at 14:56

Allan Cameron

147,086
7
49
87

Thanks, this works exactly as i wanted. As i have more than 2 variables to apply this to, is it possible to do this without explicitly stating each variable? e.g. could i try this in a loop, and loop through A1, B1, C1 etc? – Mark May 16 '20 at 19:03
something like this: `for (i in c("A1", "B1")){ within(df, {i[is.na(i) & time == 0] <- mean(i[trt == "2" & time == 0])})}` but its giving error: `In mean.default(i[trt == "2" & time == 0]) : argument is not numeric or logical: returning NA` – Mark May 16 '20 at 19:17

Len Greski · Answer 2 · 2020-05-16T23:16:43.350

If we add group_by(time), we can recode the missing columns to the time-specific mean values for the observations where time == 0 as follows.

df <- structure(list( time = c(0, 0, 0, 0, 0, 0, 14, 14, 14, 14, 14, 14, 33, 33, 33, 33, 33, 33, 90, 90, 90, 90, 90, 90),
                      trt = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L),
                                      .Label = c("1", "2"), class = "factor"), 
                      A1 = c(6.301, 5.426, 5.6021, NA, NA, NA, 6.1663, 6.426, 6.8239, 2.301, 4.7047, 2.301, 5.8062, 4.97, 4.97, 2.301, 2.301, 2.301, 2.301, 2.301, 2.301, 2.301, 2.301, 2.301),
                      B1 = c(5.727, 5.727, 5.4472, NA, NA, NA, 6.6021, 7.028, 7.1249, 3.028, 3.1663, 3.6021, 5.727, 5.2711, 5.2389, 3.3554, 3.9031, 4.2389, 3.727, 3.6021, 3.6021, 3.8239, 3.727, 3.426)),
                row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"))

library(dplyr)
df %>% group_by(time)  %>%
     mutate(A1 = if_else(is.na(A1) & time == 0,mean(A1,na.rm=TRUE),A1),
            B1 = if_else(is.na(B1) & time == 0,mean(B1,na.rm=TRUE),B1))

...and the output:

# A tibble: 24 x 4
# Groups:   time [4]
    time trt      A1    B1
   <dbl> <fct> <dbl> <dbl>
 1     0 2      6.30  5.73
 2     0 2      5.43  5.73
 3     0 2      5.60  5.45
 4     0 1      5.78  5.63
 5     0 1      5.78  5.63
 6     0 1      5.78  5.63
 7    14 2      6.17  6.60
 8    14 2      6.43  7.03
 9    14 2      6.82  7.12
10    14 1      2.30  3.03
# ... with 14 more rows
>

UPDATE: general solution across multiple columns

Per the comments in my answer, here is a solution that uses the development version of dplyr to access the new across() function.

devtools::install_github("tidyverse/dplyr") # needed for across()
# get all columns except time and trt
theColumns <- colnames(df)[!(colnames(df) %in% c("time","trt"))]
df %>% group_by(time)  %>%
     mutate(across(theColumns,~if_else(is.na(.) & time == 0,mean(.,na.rm=TRUE),.)))

...and the output:

# Groups:   time [4]
    time trt      A1    B1
   <dbl> <fct> <dbl> <dbl>
 1     0 2      6.30  5.73
 2     0 2      5.43  5.73
 3     0 2      5.60  5.45
 4     0 1      5.78  5.63
 5     0 1      5.78  5.63
 6     0 1      5.78  5.63
 7    14 2      6.17  6.60
 8    14 2      6.43  7.03
 9    14 2      6.82  7.12
10    14 1      2.30  3.03
# … with 14 more rows
>

thanks for this - i would prefer a tidyverse / dplyr solution such as yours, but this one also replaces NA in groups other than time == 0, which i do not want... — Mark, May 16 '20 at 19:06
@Mark - Given the test data, it was unclear that you wanted to leave other time values as NA. This is easily fixed (see minor adjustment in my answer). — Len Greski, May 16 '20 at 19:52
thanks for the update. Same question to you as to Alen (apologies for lack of clarity in the question), can this be applied to a group of variables without explicitly naming them as you have, that is - do the same thing to A1, B1, C1 etc? — Mark, May 16 '20 at 21:23
@Mark - yes, will post an update later this evening US hours. — Len Greski, May 16 '20 at 21:37
@Mark -- see updated answer, using `dplyr` 1.0.0 and its new `across()` function. — Len Greski, May 16 '20 at 23:17
thanks for the update - installing the developer version seems to have been beyond me (broke a few things trying!) but i will definitely check out the solution using across() once dplyr 1.0.0 is ready — Mark, May 18 '20 at 14:07

Mark · Answer 3 · 2020-10-30T15:00:01.443

0

As i was unable to access the development version of dplyr to use the new across() function, I combined elements of both answers above to give the result i wanted:

df %>%
  mutate_if(is.numeric, funs(if_else(is.na(.) & time == 0, mean(.[trt == "2" & time == 0]), .)))

It looks like across() is intended to replace the _if functions in the long run (see here), but this solution works in the meantime.

edited Oct 30 '20 at 15:00

answered May 22 '20 at 08:29

Mark

303
1
3
12

Replace NA with mean of variable grouped by time and treatment

3 Answers3

UPDATE: general solution across multiple columns