I came across something weird with dplyr
and across
, or at least something I do not understand.
If we use the across function to compute the mean
and standard error of the mean
across multiple columns, I am tempted to use the following command:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}"),
across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}")) %>% head()
Which results in
gear mpg cyl se_mpg se_cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 16.1 7.47 NA NA
2 4 24.5 4.67 NA NA
3 5 21.4 6 NA NA
However, if I switch the order of the individual across
commands, I get the following:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}"),
across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}")) %>% head()
# A tibble: 3 x 5
gear se_mpg se_cyl mpg cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0.871 0.307 16.1 7.47
2 4 1.52 0.284 24.5 4.67
3 5 2.98 0.894 21.4 6
Why is this the case? Does it have something to do with my usage of everything()
? In my situation I'd like the mean
and the standard error of the mean
calculated across every variable in my dataset.