dplyr bug with data.table backend [in dplyr 0.4.3 or earlier]

Question

As I was browsing through the answers here, I found this solution works exactly as expected with data.frame.

library(dplyr) # dplyr_0.4.3  
library(data.table) # data.table_1.9.5 
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), 
                     a = c("AA", 
                           "AB", "AA", "AB", "AB", "AB", "AB", "AA", "AA"), b = c(2L, 5L, 
                                                                                  1L, 2L, 4L, 4L, 3L, 1L, 4L)), .Names = c("id", "a", "b"),
                class = "data.frame", row.names = c(NA, -9L))


df %>%
  group_by(id) %>%
  mutate(relevance=+(a!='AA')) %>%
  mutate(mean=cumsum(relevance * b) / cumsum(relevance))

 Source: local data frame [9 x 5]
Groups: id [3]

     id     a     b relevance  mean
  (int) (chr) (int)     (int) (dbl)
1     1    AA     2         0   NaN
2     1    AB     5         1   5.0
3     1    AA     1         0   5.0
4     2    AB     2         1   2.0
5     2    AB     4         1   3.0
6     3    AB     4         1   4.0
7     3    AB     3         1   3.5
8     3    AA     1         0   3.5
9     3    AA     4         0   3.5

However when run with data.table, it resulted in something beyond my comprehension.

setDT(df) %>%
  group_by(id) %>%
  mutate(relevance=+(a!='AA')) %>%
  mutate(mean=cumsum(relevance * b) / cumsum(relevance))

Source: local data table [9 x 5]

     id     a     b relevance     mean
  (int) (chr) (int)     (int)    (dbl)
1     1    AA     2         0      NaN
2     1    AB     5         1 5.000000
3     1    AA     1         0 5.000000
4     2    AB     2         1 3.500000
5     2    AB     4         1 3.666667
6     3    AB     4         1 3.750000
7     3    AB     3         1 3.600000
8     3    AA     1         0 3.600000
9     3    AA     4         0 3.600000

Is this an expected behaviour? If so, is there any guideline on when not to use data.table backend with dplyr?

I think you don't need two `mutate` here `setDT(df) %>% group_by(id) %>% mutate(relevance=+(a!='AA'), Mean= cumsum(relevance*b)/cumsum(relevance))` works as expected. I think what is happening is after the first `mutate`, the grouping is gone for some strange reason, and now, it is using ungrouped `cumsum` — akrun, Sep 14 '15 at 16:22
Looking at row 5, shouldn' t `cumsum(relevance * b) / cumsum(relevance)) = ([4*1] + [2*1] + [5*1])/(3) = 11/3 = 3.666667`, i.e. the `data.table` answer? — Akhil Nair, Sep 14 '15 at 16:28
@akrun Thanks for comment. I want to know how a seemingly innocuous extra `mutate` generates such behavior. — ExperimenteR, Sep 14 '15 at 16:28
It may be a bug that mess up the grouping. In general, a second mutate is not needed. — akrun, Sep 14 '15 at 16:29
Oh right I see. It seems to have forgotten about the grouping. — Akhil Nair, Sep 14 '15 at 16:29
@Frank, Thank you very much. You should post that as an answer as akrun suggested. — ExperimenteR, Sep 14 '15 at 16:47
@ExperimenteR Okay thanks, done. Maybe the shortest answer I've ever written. — Frank, Sep 14 '15 at 16:57
@Frank, thanks again for important lesson: when something goes wrong, check the github issues first. — ExperimenteR, Sep 14 '15 at 17:07

Frank · Accepted Answer · 2016-09-02T11:22:47.590

4

The bug that causes grouping to be dropped after mutate on a data.table was resolved in 0.5.0.

edited Sep 02 '16 at 11:22

answered Sep 14 '15 at 16:56

Frank

66,179
8
96
180

It might be worth saying that the bug has been closed. – Zag Sep 02 '16 at 08:11
@Zag Thanks. Edited the question and answer. – Frank Sep 02 '16 at 11:23

dplyr bug with data.table backend [in dplyr 0.4.3 or earlier]

1 Answers1