6

I want to enumerate each record of a dataframe/tibble resulted from a grouping. The index is according a defined order. If I use row_number() it does enumerate but within group. But I want that it enumerates without considering the former grouping.

Here is an example. To make it simple I used the most minimal dataframe:

library(dplyr)

df0 <- data.frame( x1 = rep(LETTERS[1:2],each=2)
                 , x2 = rep(letters[1:2], 2)
                 , y = floor(abs(rnorm(4)*10))
)
df0
#   x1 x2  y
# 1  A  a 12
# 2  A  b 24
# 3  B  a  0
# 4  B  b 12

Now, I group this table:

 df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y))

This gives me a object of class tibble:

 # A tibble: 4 x 3
 # Groups:   x1 [?]
 #   x1    x2        y
 #   <fct> <fct> <dbl>
 # 1 A     a        12
 # 2 A     b        24
 # 3 B     a         0
 # 4 B     b        12

I want to add a row number to this table using row_numer():

 df2 <- df1 %>% arrange(desc(y)) %>% mutate(index = row_number())
 df2
 # A tibble: 4 x 4
 # Groups:   x1 [2]
 #   x1    x2        y index
 #   <fct> <fct> <dbl> <int>
 # 1 A     b        24     1
 # 2 A     a        12     2
 # 3 B     b        12     1
 # 4 B     a         0     2

row_number() does enumerate within the former grouping. This was not my intention. This can be avoid converting tibble to a dataframe first:

 df2 <- df2 %>% as.data.frame() %>% arrange(desc(y)) %>% mutate(index = row_number())
 df2
 #   x1 x2  y index
 # 1  A  b 24     1
 # 2  A  a 12     2
 # 3  B  b 12     3
 # 4  B  a  0     4

My question is: is this behaviour intended? If yes: is it not very dangerous to incorporate former data processing into tibble? Which type of processing is incorporated? At the moment I will convert tibble into dataframe to avoid this kind of unexpected results.

giordano
  • 2,954
  • 7
  • 35
  • 57
  • 6
    What about adding `ungroup() %>%` before `mutate(index = row_number())`? – tmfmnk Oct 11 '18 at 14:17
  • 2
    yes, the behavior is intended since you still have grouping of previous case on. You need to `ungroup()` first. Try `df1 %>% ungroup() %>% arrange(desc(y)) %>% mutate(index = row_number())` Or do `df1 <- df0 %>% group_by(x1,x2) %>% summarize(y=sum(y)) %>% ungroup()`. – Ronak Shah Oct 11 '18 at 14:18
  • 2
    I wouldn't consider it dangerous to retain grouping--I often have multiple steps to do in a workflow that all build upon the same grouping. It's more a matter of knowing that that's what will happen and acting accordingly, i.e. calling `ungroup` when you don't want groups anymore – camille Oct 11 '18 at 14:24
  • 1
    You can save a line of code by replacing `ungroup()` by using `summarize(.groups = "drop")` which returns an `ungroup`ed `tibble`. – Dan Adams Feb 07 '22 at 00:55

2 Answers2

7

To elaborate on my comment: yes, retaining grouping is intended, and in many cases useful. It's only dangerous if you don't understand how group_by works—and that's true of any function. To undo group_by, you call ungroup.

Take a look at the group_by docs, as they're very thorough and explain how this function interacts with others, how grouping is layered, etc. The docs also explain how each call to summarise removes a layer of grouping—it might be there that you got confused about what's going on.

For example, you can group by x1 and x2, summarize y, and create a row number, which will give you the rows according to x1 (summarise removed a layer of grouping, i.e. drops the x2 grouping). Then ungrouping allows you to get row numbers based on the entire data frame.

library(dplyr)

df0 %>%
  group_by(x1, x2) %>%
  summarise(y = sum(y)) %>%
  mutate(group_row = row_number()) %>%
  ungroup() %>%
  mutate(all_df_row = row_number())
#> # A tibble: 4 x 5
#>   x1    x2        y group_row all_df_row
#>   <fct> <fct> <dbl>     <int>      <int>
#> 1 A     a        12         1          1
#> 2 A     b         2         2          2
#> 3 B     a        10         1          3
#> 4 B     b        23         2          4

A use case—I do this for work probably every day—is to get sums within multiple groups (again, x1 and x2), then to find the shares of those values within their larger group (after peeling away a layer of grouping, this is x1) with mutate. Again, here I ungroup to show the shares instead of the entire data frame.

df0 %>%
  group_by(x1, x2) %>%
  summarise(y = sum(y)) %>%
  mutate(share_in_group = y / sum(y)) %>%
  ungroup() %>%
  mutate(share_all_df = y / sum(y))
#> # A tibble: 4 x 5
#>   x1    x2        y share_in_group share_all_df
#>   <fct> <fct> <dbl>          <dbl>        <dbl>
#> 1 A     a        12          0.857       0.255 
#> 2 A     b         2          0.143       0.0426
#> 3 B     a        10          0.303       0.213 
#> 4 B     b        23          0.697       0.489

Created on 2018-10-11 by the reprex package (v0.2.1)

camille
  • 16,432
  • 18
  • 38
  • 60
2

As camille nicely showed, there are good reasons for wanting to have the result of summarize() retain additional layers of grouping and it's a documented behaviour so not really dangerous or unexpected per se.

However one additional tip is that if you are just going to call ungroup() after summarize() you might as well use summarize(.groups = "drop") which will return an ungrouped tibble and save you a line of code.

library(tidyverse)

df0 <- data.frame(
  x1 = rep(LETTERS[1:2], each = 2),
  x2 = rep(letters[1:2], 2),
  y = floor(abs(rnorm(4) * 10))
)

df0 %>% 
  group_by(x1,x2) %>% 
  summarize(y=sum(y), .groups = "drop") %>% 
  arrange(desc(y)) %>% 
  mutate(index = row_number())
#> # A tibble: 4 x 4
#>   x1    x2        y index
#>   <chr> <chr> <dbl> <int>
#> 1 A     b         8     1
#> 2 A     a         2     2
#> 3 B     a         2     3
#> 4 B     b         1     4

Created on 2022-02-06 by the reprex package (v2.0.1)

Dan Adams
  • 4,971
  • 9
  • 28