7

How can I create a histogram that shows the probability distribution given an array of numbers x ranging from 0-1? I expect each bar to be <= 1 and that if I sum the y values of every bar they should add up to 1.

For example, if x=[.2, .2, .8] then I would expect a graph showing 2 bars, one at .2 with height .66, one at .8 with height .33.

I've tried:

matplotlib.pyplot.hist(x, bins=50, normed=True)

which gives me a histogram with bars that go above 1. I'm not saying that's wrong since that's what the normed parameter will do according to documentation, but that doesn't show the probabilities.

I've also tried:

counts, bins = numpy.histogram(x, bins=50, density=True)
bins = bins[:-1] + (bins[1] - bins[0])/2
matplotlib.pyplot.bar(bins, counts, 1.0/50)

which also gives me bars whose y values sum to greater than 1.

kmosley
  • 366
  • 1
  • 2
  • 11

2 Answers2

6

I think my original terminology was off. I have an array of continuous values [0-1) which I want to discretize and use to plot a probability mass function. I thought this might be common enough to warrant a single method to do it.

Here's the code:

x = [random.random() for r in xrange(1000)]
num_bins = 50
counts, bins = np.histogram(x, bins=num_bins)
bins = bins[:-1] + (bins[1] - bins[0])/2
probs = counts/float(counts.sum())
print probs.sum() # 1.0
plt.bar(bins, probs, 1.0/num_bins)
plt.show()
kmosley
  • 366
  • 1
  • 2
  • 11
3

I think you are mistaking a sum for an integral. A proper PDF (probability distribution function) integrates to unity; if you simply take the sum you may be missing out on the size of the rectangle.

import numpy as np
import pylab as plt

N = 10**5
X = np.random.normal(size=N)

counts, bins = np.histogram(X,bins=50, density=True)
bins = bins[:-1] + (bins[1] - bins[0])/2

print np.trapz(counts, bins)

Gives .999985, which is close enough to unity.

EDIT: In response to the comment below:

If x=[.2, .2, .8] and I'm looking for a graph with two bars, one at .2 with height .66 because 66% of the values are at .2 and one bar at .8 with height .33, what would that graph be called and how do I generate it?

The following code:

from collections import Counter
x = [.2,.2,.8]
C = Counter(x)
total = float(sum(C.values()))
for key in C: C[key] /= total

Gives a "dictionary" C=Counter({0.2: 0.666666, 0.8: 0.333333}). From here one could construct a bar graph, but this would only work if the PDF is discrete and takes only a finite fixed set of values that are well separated from each other.

Asherah
  • 18,948
  • 5
  • 53
  • 72
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • Perhaps my terminology is off. If x=[.2, .2, .8] and I'm looking for a graph with two bars, one at .2 with height .66 because 66% of the values are at .2 and one bar at .8 with height .33, what would that graph be called and how do I generate it? – kmosley Oct 22 '13 at 00:58
  • What is the source of your data? Is it coming from a continuous signal or is it set of discrete events? – Hooked Oct 22 '13 at 03:42
  • It is a continuous signal which I would like to discretize so that I can look at the bar chart and say "values around .2 occur roughly x% of the time". – kmosley Oct 22 '13 at 18:19
  • If the values are continuous you'll need to bin them somehow which is what `np.histogram` does. As I said, I think that your problem lies with your interpretation of the bins. They do not need to sum up to one, they need to integrate to one. – Hooked Oct 22 '13 at 19:30