How to bin an integer column with group_by() |> tally() and show all bins, even if no values in them?

Question

I want to bin an integer column and count how many values in each bin. I have a working solution, but it's limited as it wouldn't show bins that have no values in them.

Consider the hp column in mtcars. I want to bin it by intervals of 40, to get the desired output:

# desired output
# +────────+────────+────────+
# | lower  | upper  | count  |
# +────────+────────+────────+
# | 40     | 80     | 5      |
# | 80     | 120    | 10     |
# | 120    | 160    | 4      |
# | 160    | 200    | 6      |
# | 200    | 240    | 3      |
# | 240    | 280    | 3      |
# | 280    | 320    | 0      |
# | 320    | 360    | 1      |
# +────────+────────+────────+

So far, I have the current code:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

mtcars |> 
  group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10)) |> 
  tally() |> 
  separate(hp_interval, into = c("lower", "upper"), sep = ",") |> 
  mutate(across(c(lower, upper), ~str_extract(.x, "\\d+") |> as.numeric()))

#> # A tibble: 7 x 3
#>   lower upper     n
#>   <dbl> <dbl> <int>
#> 1    40    80     5
#> 2    80   120    10
#> 3   120   160     4
#> 4   160   200     6
#> 5   200   240     3
#> 6   240   280     3  \ ❌ there's one bin missing here! ❌
#> 7   320   360     1  / the bin of 280-320 has 0 values and was implicitly removed

Since ggplot2::cut_interval() is just a wrapper around cut(), I can pass arguments to base::cut.default via .... My question is whether there's a simple trick to force showing bins with 0 values.

score 1 · Answer 1 · answered Dec 28 '22 at 14:59

One way is to use complete and complete the sequence, i.e (continuing your work)

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

mtcars |> 
  group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10)) |> 
  tally() |> 
  separate(hp_interval, into = c("lower", "upper"), sep = ",") |> 
  mutate(across(c(lower, upper), ~str_extract(.x, "\\d+") |> as.numeric())) |>
  complete(nesting(lower = seq(min(lower), max(lower), by = 40), 
                   upper = seq(min(upper), max(upper), by = 40))) |>
  mutate(n = replace_na(n, 0))

# A tibble: 8 × 3
  lower upper     n
  <dbl> <dbl> <int>
1    40    80     5
2    80   120    10
3   120   160     4
4   160   200     6
5   200   240     3
6   240   280     3
7   280   320     0
8   320   360     1

Darren Tsai · Accepted Answer · 2022-12-28T15:09:04.907

You miss .drop = FALSE in group_by(). It determines whether the groups formed by factor levels that don't appear in the data should be dropped or not.

mtcars |> 
  group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10), .drop = FALSE) |> 
  tally() |>
  extract(hp_interval, into = c("lower", "upper"), regex = "(\\d+),(\\d+)", convert = TRUE)

# A tibble: 8 × 3
  lower upper     n
  <int> <int> <int>
1    40    80     5
2    80   120    10
3   120   160     4
4   160   200     6
5   200   240     3
6   240   280     3
7   280   320     0
8   320   360     1

score 1 · Answer 3 · answered Dec 28 '22 at 21:32

and then there's hist

bin_counts <- hist(mtcars$hp, breaks=c(40,80,120,160,200,240,280,320,360), plot=FALSE)
str(bin_counts)
List of 6
 $ breaks  : num [1:9] 40 80 120 160 200 240 280 320 360
 $ counts  : int [1:8] 5 10 4 6 3 3 0 1
 $ density : num [1:8] 0.00391 0.00781 0.00313 0.00469 0.00234 ...
 $ mids    : num [1:8] 60 100 140 180 220 260 300 340
 $ xname   : chr "mtcars$hp"
 $ equidist: logi TRUE
 - attr(*, "class")= chr "histogram"

How to bin an integer column with group_by() |> tally() and show all bins, even if no values in them?

3 Answers3