0

I am experimenting with some simulated p-value distributions.

When plotting my data with hist() the result looks as expected. The p-values are uniformly distributed and there is a peak close to zero (the 'signal').

data = read_tsv("./data_vector.tsv")
hist(data$p_values, breaks = seq(0, 1, by=1/30))

enter image description here

However, when doing the same with ggplot/qplot, the peak at the left is missing:

qplot(data$p_values, geom="histogram", bins=30)

enter image description here

What did I get wrong? I would have expected the two commands to be equivalent.

Data & Code: My input data as well as a Rmarkdown-report are available from this gist

Gregor Sturm
  • 2,792
  • 1
  • 25
  • 34
  • Just specify the `breaks=` to `geom_histogram()` as well: `ggplot(data, aes(x=p_values)) + geom_histogram(bins=30, breaks = seq(0, 1, by=1/30))` – MrFlick May 17 '17 at 17:40
  • I works... but how is that different from setting bins=30? Or rather: what happens to my peak to the left when using default parameters? – Gregor Sturm May 17 '17 at 17:41
  • The short answer is that histogram binning is way more subtle & complicated than seems reasonable. Sort of like calculating a quantile. (Read `?quantile`.) – joran May 17 '17 at 17:49
  • 1
    This is what ggplot uses as bin by default with your data `ggplot2:::bin_breaks_bins(range(data$p_values), 30)`. it doesn't know the min/mad so it's splitting up the lowest bin going outside the 0/1 range. The right side of the bin is at 0.03333 for your hist, and 0.017227 for ggplot. – MrFlick May 17 '17 at 17:51
  • Thanks, that's the explanation I was looking for! – Gregor Sturm May 17 '17 at 17:56

0 Answers0