19

I am having trouble plotting a histogram as a pdf (probability)

I want the sum of all the pieces to equal an area of one so it's easier to compare across datasets. For some reason, whenever I specify the breaks (the default of 4 or whatever is terrible), it no longer wants to plot bins as a probability and instead plots bins as a frequency count.

hist(data[,1], freq = FALSE, xlim = c(-1,1), breaks = 800)

What should I change this line to? I need a probability distribution and a large number of bins. (I have 6 million data points)

This is in the R help, but I don't know how to override it:

freq logical; if TRUE, the histogram graphic is a representation of frequencies, the counts component of the result; if FALSE, probability densities, component density, are plotted (so that the histogram has a total area of one). Defaults to TRUE if and only if breaks are equidistant (and probability is not specified).

Thanks

edit: details

hmm so my plot goes above 1 which is quite confusing if it's a probability. I see how it has to do with the bin width now. I more or less want to make every bin worth 1 point while still having a lot of bins. In other words, no bin height should be above 1.0 unless it is directly at 1.0 and all the other bins are 0.0. As it stands now, I have a bins that make a hump around 15.0

edit: height by %points in bin @Dwin : So how do I plot the probability? I realize taking the integral will still give me 1.0 due to the units on the x axis, but this isn't what I want. Say I have 100 points and 5 of them fall into the first bin, then that bin should be at .05 height. This is what I want. Am I doing it wrong and there is another way this is done?

I know how many points I have. Is there a way to divide each bin count in the frequency histogram by this number?

Assad Ebrahim
  • 6,234
  • 8
  • 42
  • 68
SwimBikeRun
  • 4,192
  • 11
  • 49
  • 85
  • 3
    It's a DENSITY, not a probability. (To clarify: the fact that the integral of x*f(x) is >1.0 at some point does not imply that f(x) must be less than 1.0 at all x. The integral of x*f(x) over any range, finite or infinite will be less than or equal to 1.0.) – IRTFM Jul 02 '13 at 04:16

6 Answers6

46

To answer the request to plot probabilities rather than densities:

h <- hist(vec, breaks = 100, plot=FALSE)
h$counts=h$counts/sum(h$counts)
plot(h)
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • AWESOME!! I didn't know you could throw the histogram into a variable like that then get at the counts. – SwimBikeRun Jul 02 '13 at 19:32
  • 1
    +1 Nice. The key point is that R does not produce *relative frequency* (probability) histograms out of the box. – Assad Ebrahim Jun 26 '14 at 14:50
  • 1
    However, if you've specified breaks yourself, especially non-uniform breaks, then R defaults to showing the DENSITY, not the COUNTS (frequencies). To fix this you need another line before plotting: `plot(h, freq=TRUE)`. Suggest adding this to your answer to make it fully general. – Assad Ebrahim Jun 26 '14 at 14:52
  • If I read your suggestion correctly, it appears to outline an alternate method. If it really does something useful, then perhaps you should compose your own answer that demonstrates its value. (At the moment it looks to me that it would not succeed.) – IRTFM Jun 26 '14 at 15:48
3

The default number of breaks is around log2(N) where N is 6 million in your case, so should be 22. If you're only seeing 4 breaks, that could be because you have xlim in your call. This doesn't change the underlying histogram, it only affects which part of it is plotted. If you do

h <- hist(data[,1], freq=FALSE, breaks=800)
sum(h$density * diff(h$breaks))

you should get a result of 1.


The density of your data is related to its units of measurement; therefore you want to make sure that "no bin height should be above 1.0" is actually meaningful. For example, suppose we have a bunch of measurements in feet. We plot the histogram of the measurements as a density. We then convert all the measurements to inches (by multiplying by 12) and do another density-histogram. The height of the density will be 1/12th of the original even though the data is essentially the same. Similarly, you could make your bin heights all less than 1 by multiplying all your numbers by 15.

Does the value 1.0 have some kind of significance?

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
  • 1
    Yes, 1.0 is very meaningful. I want to look at a bin and see the % of points that are in that bin. The problem is that setting the breaks manually destroys the freq=FALSE part of the hist(): the part that would normally make it a % histogram for me. My different plots must be in probability plot first or else the scales don't match up and it is impossible to compare them – SwimBikeRun Jul 02 '13 at 03:46
2

Are you sure? This is working for me:

> vec <- rnorm(6000000)
> 
> h <- hist(vec, breaks = 800, freq = FALSE)
> sum(h$density)
[1] 100
> unique(zapsmall(diff(h$breaks)))
[1] 0.01

Multiply the last two results and you get a probability density sum of 1. Remember that the bin width is important here.

This is with

> sessionInfo()
R version 3.0.1 RC (2013-05-11 r62732)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.0.1
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • hmm so my plot goes above 1 which is quite confusing if it's a probability. I see how it has to do with the bin width now. I more or less want to make every bin worth 1 point while still having a lot of bins. In other words, no bin should be above 1.0 unless it is directly at 1.0 and all the other bins are 0.0 – SwimBikeRun Jul 02 '13 at 03:26
  • 2
    Wrong. It is not a probability. And this should probably be elevated to a FAQ considering the number of times it has been asked. – IRTFM Jul 02 '13 at 04:15
  • 1
    +1 for DWin's comment. What DWin calls density is often called *probability density*, or, perhaps more strictly, the density estimate is for the probability density *function* of the variable. If you want more information on this, read [Density Estimation](http://en.wikipedia.org/wiki/Density_estimation) and [Probability Density Function](http://en.wikipedia.org/wiki/Probability_density_function) on Wikipedia. – Gavin Simpson Jul 02 '13 at 14:58
  • @Dwin : So how do I plot the probability? I realize taking the integral will still give me 1.0 due to the units on the x axis, but this isn't what I want. Say I have 100 points and 5 of them fall into the first bin, then that bin should be at .05 height. This is what I want. Am I doing it wrong and there is another way this is don? – SwimBikeRun Jul 02 '13 at 18:16
  • When dealing with continuous functions, you cannot talk about the "probability of f(x) at x=2", since the integral from 2 to 2 will always be 0. You can only talk about probabilities over non-zero length intervals. For the cumulative probability function you can plot `cumsum(f(x))/sum(f(x)` evaluated at a sequence of particular x's of your choosing. For the probability in selected intervals you can use `cx <- cut(x, breaks)` and `table(cx)` and divide that matrix by `sum(table(cx))`. – IRTFM Jul 02 '13 at 18:30
  • I suppose you could do this with a `histogram`-object made with `freq=TRUE, plot=FALSE`, but divide the `density` list by `hist()$counts` before plotting. – IRTFM Jul 02 '13 at 18:38
  • I asked the wrong question, but your other answer is exactly what I wanted to do. I question I should have asked was "How do I calculate the bin probabilities and plot them?" – SwimBikeRun Jul 02 '13 at 19:35
1
set.seed(0)

# Define a fair coin:
coin = c(1,0)

# We tossed the coin 10 times and counted the number of heads. Repeat the experiment 20000 times.
n = 20000   # Number of experiments
flips = 10  # Number of coin flips in each experiment.

heads = colSums(replicate(n, sample(coin, flips, replace = T))) # Counts of heads in each experiment.

# The breaks are the number of possible outcomes: flips + 1

h = hist(heads, breaks = sort(unique(heads)), freq=F, 
          border=F, main = 'Histogram counts of heads',
          col=rgb(0.3,0.8,0.8,0.6), ylab='Probability', 
          xlab =  'No. of heads in 10 flips fair coin')

enter image description here


If it helps anyone landing here, check this solution:

set.seed(0)

d = rnorm(1000)
n = 1000
d = rnorm(n)

histogram = hist(d, breaks=10, prob=T, border=F)
unique(diff(histogram$breaks)) # Because the size of the base of the rectangles is 0.5, the height will be double the tru relative freq.

# The fix. Notice that I redefine the histogram simply to show how simple the call is with with this fix.
h = hist(d, plot=F)
bp = barplot(h$counts/sum(h$counts), border=F)
axis(1, at=c(bp), labels=h$mids)
title(ylab="Relative Frequency")

Thanks to this answer.

Antoni Parellada
  • 4,253
  • 6
  • 49
  • 114
0

I observed that, in histogram density = relative frequency / corresponding bin width

Example 1:

nums = c(10, 41, 10, 28, 22,  8, 31,  3,  9,  9)

h2 = hist(nums, plot=F)

rf2 = h2$counts / sum(h2$counts)

d2 = rf2 / diff(h2$breaks)

h2$density

[1] 0.06 0.00 0.02 0.01 0.01

d2

[1] 0.06 0.00 0.02 0.01 0.01

Example 2:

nums = c(10, 41, 10, 28, 22,  8, 31,  3,  9,  9)

h3 = hist(nums, plot=F, breaks=c(1,30,40,50))

rf3 = h3$counts / sum(h3$counts)

d3 = rf3 / diff(h3$breaks)

h3$density

[1] 0.02758621 0.01000000 0.01000000

d3

[1] 0.02758621 0.01000000 0.01000000
camille
  • 16,432
  • 18
  • 38
  • 60
-1

R has a bug or something. If you have discrete data in a data.frame (with 1 column), and call hist(DF,freq=FALSE) on it, the relative densities will be wrong (summing to >1). This shouldn't happen as far as I can tell.

The solution is to call unlist() on the object first. This fixes the plot. enter image description hereenter image description here (I changed the text too, data from http://www.electionstudies.org/studypages/anes_timeseries_2012/anes_timeseries_2012.htm)

CoderGuy123
  • 6,219
  • 5
  • 59
  • 89
  • 3
    I strongly suspect this is not a bug, but the fact that the bins have width < 1, so you need `sum(dens)*delta`, not just `sum(dens)` – Ben Bolker Mar 15 '15 at 20:54
  • e.g.: `x <- rep(1:10,1:10); h1 <- hist(x,freq=FALSE); sum(h1$density)` is 1. If you use `h2 <- hist(x,freq=FALSE,breaks=50)` then you need `sum(h2$density*diff(h2$mids)[1])` instead of `sum(h2$density)` – Ben Bolker Mar 15 '15 at 22:16
  • The width of the bins being <1 is precisely the cause of the problem, and part of the bug. I am using discrete data. – CoderGuy123 Mar 15 '15 at 22:29
  • 3
    I claim this is **not** a bug, it's a misunderstanding of `hist()` on your part. How about `prop.table(table(x))` ? – Ben Bolker Mar 15 '15 at 22:51
  • Well then, why does it work properly with unlist() then? plot(prop.table(table(x))) works too without unlist() (also works with). Data and code here: http://emilkirkegaard.dk/en/?p=4928 – CoderGuy123 Mar 15 '15 at 23:04
  • 1
    because you have the `Hmisc` package loaded, which loads a separate `hist.data.frame` S3 method; it chooses the number of bins in a different way from `hist.default` in base R. (This is getting to the point where it should really be posed as a new question.) – Ben Bolker Mar 15 '15 at 23:55
  • Well, I use Hmisc about 100% of the time, so that explains why I always have to use unlist(). :) – CoderGuy123 Mar 16 '15 at 00:29
  • If you didn't you would get `Error in hist.default(d) : 'x' must be numeric` instead. – Ben Bolker Mar 16 '15 at 00:40