2

I want to bin an integer column and count how many values in each bin. I have a working solution, but it's limited as it wouldn't show bins that have no values in them.

Consider the hp column in mtcars. I want to bin it by intervals of 40, to get the desired output:

# desired output
# +────────+────────+────────+
# | lower  | upper  | count  |
# +────────+────────+────────+
# | 40     | 80     | 5      |
# | 80     | 120    | 10     |
# | 120    | 160    | 4      |
# | 160    | 200    | 6      |
# | 200    | 240    | 3      |
# | 240    | 280    | 3      |
# | 280    | 320    | 0      |
# | 320    | 360    | 1      |
# +────────+────────+────────+

So far, I have the current code:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

mtcars |> 
  group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10)) |> 
  tally() |> 
  separate(hp_interval, into = c("lower", "upper"), sep = ",") |> 
  mutate(across(c(lower, upper), ~str_extract(.x, "\\d+") |> as.numeric()))

#> # A tibble: 7 x 3
#>   lower upper     n
#>   <dbl> <dbl> <int>
#> 1    40    80     5
#> 2    80   120    10
#> 3   120   160     4
#> 4   160   200     6
#> 5   200   240     3
#> 6   240   280     3  \ ❌ there's one bin missing here! ❌
#> 7   320   360     1  / the bin of 280-320 has 0 values and was implicitly removed

Since ggplot2::cut_interval() is just a wrapper around cut(), I can pass arguments to base::cut.default via .... My question is whether there's a simple trick to force showing bins with 0 values.

Emman
  • 3,695
  • 2
  • 20
  • 44

3 Answers3

1

One way is to use complete and complete the sequence, i.e (continuing your work)

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

mtcars |> 
  group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10)) |> 
  tally() |> 
  separate(hp_interval, into = c("lower", "upper"), sep = ",") |> 
  mutate(across(c(lower, upper), ~str_extract(.x, "\\d+") |> as.numeric())) |>
  complete(nesting(lower = seq(min(lower), max(lower), by = 40), 
                   upper = seq(min(upper), max(upper), by = 40))) |>
  mutate(n = replace_na(n, 0))

# A tibble: 8 × 3
  lower upper     n
  <dbl> <dbl> <int>
1    40    80     5
2    80   120    10
3   120   160     4
4   160   200     6
5   200   240     3
6   240   280     3
7   280   320     0
8   320   360     1
Sotos
  • 51,121
  • 6
  • 32
  • 66
1

You miss .drop = FALSE in group_by(). It determines whether the groups formed by factor levels that don't appear in the data should be dropped or not.

mtcars |> 
  group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10), .drop = FALSE) |> 
  tally() |>
  extract(hp_interval, into = c("lower", "upper"), regex = "(\\d+),(\\d+)", convert = TRUE)

# A tibble: 8 × 3
  lower upper     n
  <int> <int> <int>
1    40    80     5
2    80   120    10
3   120   160     4
4   160   200     6
5   200   240     3
6   240   280     3
7   280   320     0
8   320   360     1
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
1

and then there's hist

bin_counts <- hist(mtcars$hp, breaks=c(40,80,120,160,200,240,280,320,360), plot=FALSE)
str(bin_counts)
List of 6
 $ breaks  : num [1:9] 40 80 120 160 200 240 280 320 360
 $ counts  : int [1:8] 5 10 4 6 3 3 0 1
 $ density : num [1:8] 0.00391 0.00781 0.00313 0.00469 0.00234 ...
 $ mids    : num [1:8] 60 100 140 180 220 260 300 340
 $ xname   : chr "mtcars$hp"
 $ equidist: logi TRUE
 - attr(*, "class")= chr "histogram"
Chris
  • 1,647
  • 1
  • 18
  • 25