48

I'd like to use Matplotlib to plot a histogram over data that's been pre-counted. For example, say I have the raw data

data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]

Given this data, I can use

pylab.hist(data, bins=[...])

to plot a histogram.

In my case, the data has been pre-counted and is represented as a dictionary:

counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}

Ideally, I'd like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I'm expanding my counts into the raw data:

data = list(chain.from_iterable(repeat(value, count)
            for (value, count) in counted_data.iteritems()))

This is inefficient when counted_data contains counts for millions of data points.

Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?

Alternatively, if it's easiest to just bar-plot data that's been pre-binned, is there a convenience method to "roll-up" my per-item counts into binned counts?

scrpy
  • 985
  • 6
  • 23
Josh Rosen
  • 13,511
  • 6
  • 58
  • 70
  • 1
    As a sidenote: To expand your counts into raw data, you could also use the `Counter` class and its elements() method : `from collections import Counter` `c = Counter(counted_data)` `data = list(c.elements())` – Moncef M. Nov 27 '14 at 15:18

6 Answers6

34

You can use the weights keyword argument to np.histgram (which plt.hist calls underneath)

val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)

Assuming you only have integers as the keys, you can also use bar directly:

min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())

bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)

for k,v in counted_data.items():
    vals[k - min_bin] = v

plt.bar(bins, vals, ...)

where ... is what ever arguments you want to pass to bar (doc)

If you want to re-bin your data see Histogram with separate list denoting frequency

Community
  • 1
  • 1
tacaswell
  • 84,579
  • 22
  • 210
  • 199
  • Thanks for the pointer to the `weights` option; I had overlooked it, but it solves my problem perfectly (see my answer). – Josh Rosen Oct 06 '13 at 22:27
  • I hadn't made that connection (got blinded by directly using `bar`). Edited to reflect your comment. – tacaswell Oct 06 '13 at 22:39
25

I used pyplot.hist's weights option to weight each key by its value, producing the histogram that I wanted:

pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))

This allows me to rely on hist to re-bin my data.

Josh Rosen
  • 13,511
  • 6
  • 58
  • 70
  • and your way of getting the data out makes more sense than mine. It's fine with me if you accept your own answer. – tacaswell Oct 06 '13 at 22:50
  • 1
    This was the clue I needed. In my case I have a list of counts, and bin ranges: `plt.hist(bins, bins=len(bins), weights=counts)` was the invocation I needed – Ash Berlin-Taylor Nov 08 '17 at 17:18
  • Word of warning: I have noticed that this gives incorrect result if bins have different size, and `density=True` is used. Probably not a bug, rather a mathematical difference between pdf and cdf. – icemtel Nov 24 '20 at 10:08
6

You can also use seaborn to plot the histogram :

import seaborn as sns

sns.distplot(
    list(
        counted_data.keys()
    ), 
    hist_kws={
        "weights": list(counted_data.values())
    }
)
macrocosme
  • 473
  • 7
  • 24
youssef mhiri
  • 133
  • 3
  • 11
4

the length of the "bins" array should be longer than the length of "counts". Here's the way to fully reconstruct the histogram:

import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
                             weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)
R. Yang
  • 758
  • 6
  • 4
0

Adding to tacaswell's comment, plt.bar can be much more efficient than plt.hist here for large numbers of bins (>1e4). Especially for a crowded random plot where you only need plot the highest bars because the width required to see them will cover most of their neighbors anyway. You can pick out the highest bars and plot them with

i, = np.where(vals > min_height)
plt.bar(i,vals[i],width=len(bins)//50)

Other statistical trends may prefer to instead plot every 100th bar or something similar.

The trick here is that plt.hist wants to plot all of your bins whereas plt.bar will let you just plot the sparser set of visible bins.

Max
  • 805
  • 1
  • 6
  • 9
0

hist uses bar under the hood, this will produce something similar to what hist creates (assumes bins of equal size):

bins = [1,2,3]
heights = [10,20,30]

ax = plt.gca()
ax.bar(bins, heights, align='center', width=bins[-1] - bins[-2])
Eduardo
  • 1,383
  • 8
  • 13