n() acting inconsistently when used in summarise_at()

Question

Using this example data:

library(tidyverse)

set.seed(123)
df <- data_frame(X1 = rep(LETTERS[1:4], 6),
                 X2 = sort(rep(1:6, 4)),
                 ref = sample(1:50, 24),
                 sampl1 = sample(1:50, 24),
                 var2 = sample(1:50, 24),
                 meas3 = sample(1:50, 24))

I can use summarise_at() to count the number of values in a subset of columns:

df %>% summarise_at(vars(contains("2")), funs(sd_expr = n() ))

This isn't very exciting as it is the same as the number of rows. However it would be useful in a table with a nested column with each cell containing a data frame with a differing number of rows in each cell.

For example,

df %>% 
  mutate_at(vars(-one_of(c("X1", "X2", "ref"))), funs(first = . - ref)) %>% 
  mutate_at(vars(contains("first")),  funs(second = . *2 )) %>%
  nest(-X1) %>%  
  mutate(mean = map(data, 
                  ~ summarise_at(.x, vars(contains("second")),
                                     funs(mean_second = mean(.) ))),
         n = map(data, 
                  ~ summarise_at(.x, vars(contains("second")),
                                     funs(n_second = n()  ))) ) %>%
  unnest(mean, n)

However I get the error:

Error in mutate_impl(.data, dots) : Evaluation error: Can't create call to non-callable object.

Why does the mean() function work in this context and n() does not?

Now a couple of work arounds could be either:

n = map(data, ~ summarise_at(.x, vars(contains("second")),    
                                 funs(n_second = length(unique(.))  )))

but this is not robust to when there are identical values on different rows or alternatively:

n = map(data, ~ nrow(.x)  )

but this does not allow me to build more complicated summarise_at() functions which is what I'm really aiming for. Ultimately I'd like to do something like this to calculate standard errors:

se = map(data, ~ summarise_at(.x, vars(contains("second")),
                                         funs(se_second = sd(.)/sqrt(n())  )))

Why is n() not doing what I think it should do in this situation?

I suppose I could use `length()` but am curious as to what is going on with `n()` too. — G_T, Aug 26 '17 at 01:55
You can get a result using `rlang::expr( n() )`, but that returns the number of rows in the original dataset. It looks like it could be related to [this open dplyr issue](https://github.com/tidyverse/dplyr/issues/2080) — aosmith, Aug 28 '17 at 14:49

score 0 · Accepted Answer · answered Dec 20 '18 at 23:04

I believe aosmith's comment is correct, and this is an example of this issue:

#2080: Using n() in nested mutate()/summarize() calls gives unexpected results

The reason is because of dplyr's hybrid evaluation, where it recognizes certain R functions as things it knows how to handle in the C++ code, and replaces them. In this case, the replacement was too aggressive. In particular, the mutate replaced n() with the number 4 (because there were 4 rows in the outer data frame after nesting, although the nested data frames themselves each had 6 rows). You can see this by running the following:

library(tidyverse)

set.seed(123)
df <- data_frame(X1 = rep(LETTERS[1:4], 6),
                 X2 = sort(rep(1:6, 4)),
                 ref = sample(1:50, 24),
                 sampl1 = sample(1:50, 24),
                 var2 = sample(1:50, 24),
                 meas3 = sample(1:50, 24))

df1 <- df %>% 
  mutate_at(vars(-one_of(c("X1", "X2", "ref"))), funs(first = . - ref)) %>% 
  mutate_at(vars(contains("first")),  funs(second = . *2 )) %>% print %>% 
  nest(-X1)

debugonce(map)

df1 %>% mutate(n = map(data,
                       ~ summarize_at(.x,
                                      vars(contains("second")),
                                      funs(n_second = n()))))

In dplyr 0.7.8, this produces the message:

debugging in: map(data, ~summarize_at(.x, vars(contains("second")), funs(n_second = 4L)))

And of course funs(4) won't work because 4 isn't callable, and so you get the error.

Perhaps more pernicious is if you had tried to fix it by doing something like this:

df1 %>% mutate(n = map(data,
                       ~ summarize_at(.x,
                                      vars(contains("second")),
                                      . %>% { n() }))) %>%
  unnest(n)

In dplyr 0.7.8 that runs without errors, but gives you the wrong answer: counts of 4 instead of 6, because it's using the number of rows in the outer data frame, rather than in the nested ones.

Luckily, all of this should be fixed in dplyr 0.8.0, due to this change:

#3526: hybrid all or nothing

With that change, the call to mutate wouldn't have replaced the n(), because it doesn't know how to replace the full expression containing that n() (and as we've seen, the surrounding expression can change the meaning of n()).

As far as alternatives that work in previous versions of dplyr, it seems to me that the calculations you were interested in could be achieved without nesting, by using group_by:

df %>% 
  mutate_at(vars(-one_of(c("X1", "X2", "ref"))), funs(first = . - ref)) %>% 
  mutate_at(vars(contains("first")),  funs(second = . *2 )) %>%
  group_by(X1) %>%  
  summarise_at(vars(contains("second")),
               funs(mean_second = mean(.),
                    n_second = n(),
                    se_second = sd(.)/sqrt(n()) ))

A small comment for the future reader: `mean(.)` could also be written simply `mean`, but the `(.)` allows to pass further arguments, e.g. with `mean(., na.rm = TRUE`). Also, the calculation of se_second is only possible using `sd(.)` — tjebo, Nov 04 '19 at 12:42

n() acting inconsistently when used in summarise_at()

1 Answers1