0

Base R hist() function uses the Sturges method to calculate the optimal number of bins, unlike ggplot2::geom_histogram. There is a short tutorial showing how to replicate the Sturges method using ggplot2::geom_histogram:

https://r-charts.com/distribution/histogram-sturges-ggplot2/

The reprex is as follows, and works as expected:

# install.packages("ggplot2")
library(ggplot2)

# Data
set.seed(3)
x <- rnorm(450)
df <- data.frame(x)

# Calculating the Sturges bins
breaks <- pretty(range(x),
                 n = nclass.Sturges(x),
                 min.n = 1)
df$breaks <- breaks

# Histogram with Sturges method
ggplot(df, aes(x = x)) + 
  geom_histogram(color = 1, fill = "white",
                 breaks = breaks) +
  ggtitle("Sturges method") 

Created on 2022-09-01 by the reprex package (v2.0.1)

However, when tried on my own data, it didn't work. It seems that the reprex above capitalizes on chance to work, because if the size of the dataframe is changed, then the same error arises:

# install.packages("ggplot2")
library(ggplot2)

# Data
set.seed(3)
x <- rnorm(400)
df <- data.frame(x)

# Calculating the Sturges bins
breaks <- pretty(range(x),
                 n = nclass.Sturges(x),
                 min.n = 1)
df$breaks <- breaks
#> Error in `$<-.data.frame`(`*tmp*`, breaks, value = c(-2.5, -2, -1.5, -1, : replacement has 14 rows, data has 400

Created on 2022-09-01 by the reprex package (v2.0.1)

How can I make this solution generalizable to all datasets, just like in the base R function?

Edit: looking for an automatized solution for use in a function, so can't set breaks manually.

rempsyc
  • 785
  • 5
  • 24
  • 1
    You don't need the `df$breaks <- breaks` line at all. You can just remove it. You only need to use `breaks` in the `geom_histogram` – MrFlick Sep 01 '22 at 17:50

1 Answers1

0

In your first example you had 450 and 15 breaks which fits perfectly. For your second example, try with two extra breaks, because you can't divide 400 by 14. So instead use 16 breaks like this:

library(ggplot2)

# Data
set.seed(3)
x <- rnorm(400)
df <- data.frame(x)

# Calculating the Sturges bins
breaks <- pretty(range(x),
                 n = nclass.Sturges(x),
                 min.n = 1)
# Add two breaks
df$breaks <- c(breaks, 4.5, 5)

ggplot(df, aes(x = x)) + 
  geom_histogram(color = 1, fill = "white",
                 breaks = breaks) +
  ggtitle("Sturges method") 

Created on 2022-09-01 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53
  • 1
    There doesn't seem to be any point to the `df$breaks <- c(breaks, 4.5, 5)` line. That column isn't used by the plot at all. It doesn't seem necessary and just removing it also seems to work fine. – MrFlick Sep 01 '22 at 17:51
  • @MrFlick, Aah you are right! But why do they use that line in the tutorial? – Quinten Sep 01 '22 at 17:53
  • 1
    It just looks like a mistake. – MrFlick Sep 01 '22 at 17:55
  • I'm sorry, I'm looking for an automatized behaviour that mimics base R hist(), for use in a function, so I can't set breaks manually. @MrFlick, you got the real answer in your other comment, mind posting it as a formal answer to get the points? – rempsyc Sep 01 '22 at 18:27
  • 1
    I just closed as a duplicate that already had working code to do this: https://stackoverflow.com/questions/25146544/r-emulate-the-default-behavior-of-hist-with-ggplot2-for-bin-width. I guess the lesson is if the first tutorial doesn't work ,try another. – MrFlick Sep 01 '22 at 18:29