summarize across -- is it order dependent?

Question

I came across something weird with dplyr and across, or at least something I do not understand.

If we use the across function to compute the mean and standard error of the mean across multiple columns, I am tempted to use the following command:

mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
  summarize(across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}"),
            across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}")) %>% head()

Which results in

   gear   mpg   cyl se_mpg se_cyl
  <dbl> <dbl> <dbl>  <dbl>  <dbl>
1     3  16.1  7.47     NA     NA
2     4  24.5  4.67     NA     NA
3     5  21.4  6        NA     NA

However, if I switch the order of the individual across commands, I get the following:

mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
  summarize(across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}"),
            across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}")) %>% head()

# A tibble: 3 x 5
   gear se_mpg se_cyl   mpg   cyl
  <dbl>  <dbl>  <dbl> <dbl> <dbl>
1     3  0.871  0.307  16.1  7.47
2     4  1.52   0.284  24.5  4.67
3     5  2.98   0.894  21.4  6

Why is this the case? Does it have something to do with my usage of everything()? In my situation I'd like the mean and the standard error of the mean calculated across every variable in my dataset.

Can you please make a complete example? Why does your code start with `summarize` ? Where is the dataframe/ — Ronak Shah, Aug 25 '20 at 15:02

score 2 · Accepted Answer · answered Aug 25 '20 at 15:58

I have no idea why summarize behaves like that, it's probably due to an underlying interaction of the two across functions (although it seems weird to me). Anyway, I suggest you to write a single across statement and use a list of lambda functions as suggested by the across documentation.

In this way it doesn't matter if the mean or the standard deviation is specified as first function, you will get no NAs.

mtcars %>% 
  group_by(gear) %>% 
  select(mpg, cyl) %>% 
  summarize(across(everything(), list(
    mean = ~mean(.x, na.rm = TRUE),
    se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x)))
  ), .names = "{fn}_{col}"))

# A tibble: 3 x 5
#    gear mean_mpg se_mpg mean_cyl se_cyl
#   <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
# 1     3     16.1  0.871     7.47  0.307
# 2     4     24.5  1.52      4.67  0.284
# 3     5     21.4  2.98      6     0.894



mtcars %>% 
  group_by(gear) %>% 
  select(mpg, cyl) %>% 
  summarize(across(everything(), list(
    se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x))),
    mean = ~mean(.x, na.rm = TRUE)
  ), .names = "{fn}_{col}"))

# A tibble: 3 x 5
#    gear se_mpg mean_mpg se_cyl mean_cyl
#  <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
# 1     3  0.871     16.1  0.307     7.47
# 2     4  1.52      24.5  0.284     4.67
# 3     5  2.98      21.4  0.894     6

Thanks. It's odd that it's happening, since the operations should be happening simultaneously. — vashts85, Aug 25 '20 at 17:29
I agree with you. I didn't manage to understand the issue, only to find a workaround — Ric S, Aug 25 '20 at 20:42

summarize across -- is it order dependent?

1 Answers1