R hist() and qplot(geom="histogram") are inconsistent

Asked May 17 '17 at 17:36

Active May 17 '17 at 17:36

Viewed 108 times

I am experimenting with some simulated p-value distributions.

When plotting my data with hist() the result looks as expected. The p-values are uniformly distributed and there is a peak close to zero (the 'signal').

data = read_tsv("./data_vector.tsv")
hist(data$p_values, breaks = seq(0, 1, by=1/30))

However, when doing the same with ggplot/qplot, the peak at the left is missing:

qplot(data$p_values, geom="histogram", bins=30)

What did I get wrong? I would have expected the two commands to be equivalent.

Data & Code: My input data as well as a Rmarkdown-report are available from this gist

asked May 17 '17 at 17:36

Gregor Sturm

2,792
1
25
34

Just specify the `breaks=` to `geom_histogram()` as well: `ggplot(data, aes(x=p_values)) + geom_histogram(bins=30, breaks = seq(0, 1, by=1/30))` – MrFlick May 17 '17 at 17:40
I works... but how is that different from setting bins=30? Or rather: what happens to my peak to the left when using default parameters? – Gregor Sturm May 17 '17 at 17:41
The short answer is that histogram binning is way more subtle & complicated than seems reasonable. Sort of like calculating a quantile. (Read `?quantile`.) – joran May 17 '17 at 17:49
1

This is what ggplot uses as bin by default with your data `ggplot2:::bin_breaks_bins(range(data$p_values), 30)`. it doesn't know the min/mad so it's splitting up the lowest bin going outside the 0/1 range. The right side of the bin is at 0.03333 for your hist, and 0.017227 for ggplot. – MrFlick May 17 '17 at 17:51
Thanks, that's the explanation I was looking for! – Gregor Sturm May 17 '17 at 17:56

R hist() and qplot(geom="histogram") are inconsistent

0 Answers0