Grouping by multiple dimensions, summarise and add calculated column

Question

I have this df:

  boxChange   sameCat
# C1 > C2     TRUE
# C1 > C2     TRUE
# A0 > A1     TRUE
# A1 > E4     FALSE
# C3 > E6     FALSE
# E0 > E3     TRUE
# ...         ...

I would like to group by both columns, count the occurrences and arrange by their number. By using dplyr I would go like this:

df2 <- df %>%  
group_by(boxChange, sameCat) %>%
summarise(occs = n()) %>%
arrange(desc(occs))

Obtaining:

  boxChange   sameCat   occs
# C1 > C2     TRUE      312
# A0 > A1     TRUE      189
# E0 > E3     TRUE      13
# C3 > E6     FALSE     123
# A1 > E4     FALSE     70

Now I would like to compute the percentage of each occs over the total and the cumulative percentage, obtaining something like this

  boxChange   sameCat   occs   perc   cump
# C1 > C2     TRUE      312    44      44
# A0 > A1     TRUE      189    27      71
# E0 > E3     TRUE       13     2      73
# C3 > E6     FALSE     123    17      90
# A1 > E4     FALSE      70    10     100

I tried with the following

df2 <- df %>%  
group_by(boxChange, sameCat) %>%
summarise(occs = n()) %>%
arrange(desc(occs)) %>%
mutate(perc = occs/sum(occs)*100) %>%
mutate(cump = cumsum(perc))

But the output is the following

  boxChange   sameCat   occs   perc   cump
# C1 > C2     TRUE      312    100     100
# A0 > A1     TRUE      189    100     100
# E0 > E3     TRUE       13    100     100
# C3 > E6     FALSE     123    100     100
# A1 > E4     FALSE      70    100     100

I cannot understand why it is like this and couldn't find any other thread reporting a similar issue. Do you have any insight?

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

We may need ungroup

df2 <- df %>%  
       group_by(boxChange, sameCat) %>%
        summarise(occs = n()) %>%
        arrange(desc(occs)) %>%
        ungroup %>%
        mutate(perc = occs/sum(occs)*100, 
               cump = cumsum(perc))

--

Or if we need to keep the grouping intact, use sum(.$occs)

Update

If we start from the OP's arraged 'occs'

df %>% 
  ungroup %>% 
  mutate(perc = round(occs/sum(occs) * 100),
         cump = cumsum(perc))
#   boxChange sameCat occs perc cump
#1   C1 > C2    TRUE  312   44   44
#2   A0 > A1    TRUE  189   27   71
#3   E0 > E3    TRUE   13    2   73
#4   C3 > E6   FALSE  123   17   90
#5   A1 > E4   FALSE   70   10  100

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 17 '19 at 21:01

akrun

874,273
37
540
662

Thank you! `sum(.$occs)` get the `perc` column right, though I keep getting the wrong cumulative sum, it's output exactly equal to `perc` – tuspazio Jun 17 '19 at 21:24
@tuspazio I started with the input example from 'df2' having the 'occs' column and have no difficultly in getting the 'cump' – akrun Jun 17 '19 at 21:33
1

Thanks again, don't know what I got wrong but now it works properly. Thanks again – tuspazio Jun 17 '19 at 21:43
@tuspazio If you have loaded `plyr`, then the `plyr::mutate` could mask the `dplyr::mutate` and leads to misleading results. – akrun Jun 17 '19 at 21:44
No 'plyer' is not loaded – tuspazio Jun 17 '19 at 21:45
ok, then could be some variables in the global env, which got removed on a fresh session and starts behaving as intended – akrun Jun 17 '19 at 21:46

Grouping by multiple dimensions, summarise and add calculated column

1 Answers1

Update