In R, unexpected result from using group_by() and summarise() in dplyr

Question

I don't quite understand how some of the groupings and summaries are built in R using dplyr package.

With the reproducible example below I'm trying to first group by (PN,GOT,HID) to count distinct instances of PC1. I then regroup by (PN,GOT) to sum over the distinct instances of PC1, based on the second grouping. This process seems to work for the total sums, except that for mean(TC) I get the mean of the entire data frame when I would expect to see the means by groupings of (PN,GOT). What am i missing to get those means of (PN,GOT), while not losing the sums over PC1 that I've built? I would appreciate some explanation of where I'm going wrong here.

PN<- c("Mazda","Mazda","Datsun","Hornet","Hornet","Valiant","Duster","Merc","Merc","Merc","Merc","Merc",
       "Merc","Merc","Fiat","Honda","Toyota","Toyota","Dodge","AMC","Fiat")
GOT<- c("A","A","B","C","C","A","D","B","B","B","B","B","B","B","A","D","B","B","C","E","A")
HID<- c("Mazda_H1","Mazda_H1","Datsus_H1","Hornet_H1","Hornet_H2","Valiant_H1","Duster_H1","Merc_H1","Merc_H1","Merc_H1",
        "Merc_H2","Merc_H2","Merc_H3","Merc_H4","Fiat_H1","Honda_H1","Toyota_H1","Toyota_H2","Dodge_H1","AMC_H1","Fiat_H1")
PIC<- c("BB","BB","BB","BB","AA","AA","AA","BA","BA","BA",
        "AA","BB","BB","BB","BB","AA","AA","AA","BA","BA","BA")
TC <- c(110,110,93,175,175,105,245,62,62,62,62,62,62,62,33,52,97,97,150,150,33)
Int <- c(16.46,17.02,18.61,19.44,17.02,20.22,15.84,20.00,22.90,18.30,18.90,
         17.40,17.60,18.00,19.47,18.52,19.90,20.01,16.87,17.30,18.90)
PC1<- c("", "","G1","C1","","G1","", "G1","G1","C1","C1","","","","Z1","Z1","Z1","Z1","","","G1")

df<-data.frame(PN,GOT,HID,PIC,TC,Int,PC1)

df

df%>% filter(PC1!="") %>%
  group_by(PN, GOT, HID) %>%
  summarize(new = n_distinct(PC1)) %>%
  group_by(PN, GOT) %>%
  mutate(TOT_new = sum(new),
            meanTC = mean(TC))

I think the answer I'm looking for is something looking like this:

       PN    GOT        HID   TOT_new meanTC
   <fctr> <fctr>     <fctr>   <int>  <dbl>
1  Datsun      B  Datsus_H1     1     93
2    Fiat      A    Fiat_H1     2     33
3   Honda      D   Honda_H1     1     52
4  Hornet      C  Hornet_H1     1    175
5    Merc      B    Merc_H1     3     62
6  Toyota      B  Toyota_H1     2     97
7 Valiant      A Valiant_H1     1    105

or at least this:

       PN    GOT        HID   new TOT_new meanTC
   <fctr> <fctr>     <fctr> <int>   <int>  <dbl>
1  Datsun      B  Datsus_H1     1       1     93
2    Fiat      A    Fiat_H1     2       2     33
3   Honda      D   Honda_H1     1       1     52
4  Hornet      C  Hornet_H1     1       1    175
5    Merc      B    Merc_H1     2       3     62
6    Merc      B    Merc_H2     1       3     62
7  Toyota      B  Toyota_H1     1       2     97
8  Toyota      B  Toyota_H2     1       2     97
9 Valiant      A Valiant_H1     1       1    105

When you first `group_by(PN,GOT,HID)` and summarize, the non-grouped variables are lost. If you step through just the first few commands through `summarize`, you'll see what is happening. Perhaps you should do separate group/summary pipes and `left_join` the results? (It would help if you provide your expected output.) — r2evans, Feb 27 '17 at 03:43
@r2evans: i'll post an expected result shortly - thanks. Is there any way to keep, or call back, the lost variables for cases where one would like to go back and forth between different summaries? — val, Feb 27 '17 at 03:46
That's what I'm trying to find out. Not seeing anything, I'll stick by my first suggestion of multiple pipes (group/summarize `df` once, then do a completely separate group/summarize on `df` again, then join them together). — r2evans, Feb 27 '17 at 03:47

Psidom · Answer 1 · 2017-02-27T18:54:10.467

2

As commented by @r2evans, the reason you get a global mean is TC column is dropped during the first summarize stage. Besides the join option suggested in the comment, you can also pass the TC column info forward in the first summary stage by calculating two intermediate variables:

df %>% filter(PC1 != "") %>%

    group_by(PN, GOT, HID) %>%
    # create two columns with the sum and length of TC in each group which you can use later
    # for average calculation
    summarize(new = n_distinct(PC1), n = n(), TC_sum = sum(TC)) %>%

    group_by(PN, GOT) %>%
    summarise(TOT_new = sum(new), meanTC = sum(TC_sum)/sum(n))

# Source: local data frame [7 x 4]
# Groups: PN [?]

#       PN    GOT TOT_new meanTC
#   <fctr> <fctr>   <int>  <dbl>
#1  Datsun      B       1     93
#2    Fiat      A       2     33
#3   Honda      D       1     52
#4  Hornet      C       1    175
#5    Merc      B       3     62
#6  Toyota      B       2     97
#7 Valiant      A       1    105

edited Feb 27 '17 at 18:54

answered Feb 27 '17 at 03:59

Psidom

209,562
33
339
356

1

I like this better than a double-pipeline ... it works with `sum` and `sum`-related stats, though you might need alternate solutions when aggregates don't work as well (e.g., `median`). – r2evans Feb 27 '17 at 04:39
@r2evans Right, in that case. There's probably not an easy way to do this without a join. – Psidom Feb 27 '17 at 04:42
@Psidom: Shouldn't this be grouping by PN and GOT. Why would there be duplicates in those columns (E.g. Merc and Toyota) - those two should have been grouped into one with one value in each of TOT_new and meanTC. Seems to do that the mutate() needs to change to summarise(). – val Feb 27 '17 at 05:15
Yes. If you don't need the intermediate `new` column, it would be better to use summarize, and it makes the code shorter also. – Psidom Feb 27 '17 at 18:55

akrun · Answer 2 · 2017-02-27T05:32:20.517

We can also use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), specify the logical condition in 'i' (PC1 != ""), grouped by 'PN', 'GOT', 'HID', we get the length of uniqueelements of 'PC1 ('new'), number of elements per group (.N), and sum of 'TC', then grouped by 'PN', 'GOT', we assign the sum of 'new' and ratio of sum of 'TC_sum' with sum of 'n' to 'TOT_new' and 'meanTC'. Assign the columns that are not needed to NULL

library(data.table)
setDT(df)[PC1 != "", .(new = uniqueN(PC1), n = .N, TC_sum = sum(TC)) ,.(PN, GOT, HID)
       ][, c("TOT_new", "meanTC") := .(sum(new), sum(TC_sum)/sum(n)) ,.(PN, GOT)
         ][, c("n", "TC_sum") := NULL][]
#        PN GOT        HID new TOT_new meanTC
#1:  Datsun   B  Datsus_H1   1       1     93
#2:  Hornet   C  Hornet_H1   1       1    175
#3: Valiant   A Valiant_H1   1       1    105
#4:    Merc   B    Merc_H1   2       3     62
#5:    Merc   B    Merc_H2   1       3     62
#6:    Fiat   A    Fiat_H1   2       2     33
#7:   Honda   D   Honda_H1   1       1     52
#8:  Toyota   B  Toyota_H1   1       2     97
#9:  Toyota   B  Toyota_H2   1       2     97

could this method be used to bring back original columns of df that were not introduced right away in the beginning? Your explanation is clear to me, thanks. — val, Feb 27 '17 at 05:39
@val You can do a join with this dataset. If you notice the dataset, it is only having 9 rows where as the original dataset have more rows. So, it depends on which values you want to have it in the output — akrun, Feb 27 '17 at 05:44

val · Answer 3 · 2017-02-27T04:41:31.333

This is one way to make it work, based on some of the comments above. But it looks redundant.

df%>% filter(PC1!="") %>%
  group_by(PN, GOT, HID) %>%
  summarize(new = n_distinct(PC1),
            meanTC = mean(TC)) %>%
  group_by(PN, GOT) %>%
  mutate(TOT_new = sum(new),
         meanTC = mean(meanTC))%>%
  select(-HID)

       PN    GOT   new meanTC TOT_new
   <fctr> <fctr> <int>  <dbl>   <int>
1  Datsun      B     1     93       1
2    Fiat      A     2     33       2
3   Honda      D     1     52       1
4  Hornet      C     1    175       1
5    Merc      B     2     62       3
6    Merc      B     1     62       3
7  Toyota      B     1     97       2
8  Toyota      B     1     97       2
9 Valiant      A     1    105       1

In R, unexpected result from using group_by() and summarise() in dplyr

3 Answers3