Plotting a histogram from pre-counted data in Matplotlib

Question

I'd like to use Matplotlib to plot a histogram over data that's been pre-counted. For example, say I have the raw data

data = [1, 2, 2, 3, 4, 5, 5, 5, 5, 6, 10]

Given this data, I can use

pylab.hist(data, bins=[...])

to plot a histogram.

In my case, the data has been pre-counted and is represented as a dictionary:

counted_data = {1: 1, 2: 2, 3: 1, 4: 1, 5: 4, 6: 1, 10: 1}

Ideally, I'd like to pass this pre-counted data to a histogram function that lets me control the bin widths, plot range, etc, as if I had passed it the raw data. As a workaround, I'm expanding my counts into the raw data:

data = list(chain.from_iterable(repeat(value, count)
            for (value, count) in counted_data.iteritems()))

This is inefficient when counted_data contains counts for millions of data points.

Is there an easier way to use Matplotlib to produce a histogram from my pre-counted data?

Alternatively, if it's easiest to just bar-plot data that's been pre-binned, is there a convenience method to "roll-up" my per-item counts into binned counts?

As a sidenote: To expand your counts into raw data, you could also use the `Counter` class and its elements() method : `from collections import Counter` `c = Counter(counted_data)` `data = list(c.elements())` — Moncef M., Nov 27 '14 at 15:18

score 34 · Answer 1 · edited May 23 '17 at 12:09

34

You can use the weights keyword argument to np.histgram (which plt.hist calls underneath)

val, weight = zip(*[(k, v) for k,v in counted_data.items()])
plt.hist(val, weights=weight)

Assuming you only have integers as the keys, you can also use bar directly:

min_bin = np.min(counted_data.keys())
max_bin = np.max(counted_data.keys())

bins = np.arange(min_bin, max_bin + 1)
vals = np.zeros(max_bin - min_bin + 1)

for k,v in counted_data.items():
    vals[k - min_bin] = v

plt.bar(bins, vals, ...)

where ... is what ever arguments you want to pass to bar (doc)

If you want to re-bin your data see Histogram with separate list denoting frequency

edited May 23 '17 at 12:09

Community

1
1

answered Oct 06 '13 at 18:58

tacaswell

84,579
22
210
199

Thanks for the pointer to the `weights` option; I had overlooked it, but it solves my problem perfectly (see my answer). – Josh Rosen Oct 06 '13 at 22:27
I hadn't made that connection (got blinded by directly using `bar`). Edited to reflect your comment. – tacaswell Oct 06 '13 at 22:39

score 25 · Accepted Answer · answered Oct 06 '13 at 22:26

25

I used pyplot.hist's weights option to weight each key by its value, producing the histogram that I wanted:

pylab.hist(counted_data.keys(), weights=counted_data.values(), bins=range(50))

This allows me to rely on hist to re-bin my data.

answered Oct 06 '13 at 22:26

Josh Rosen

13,511
6
58
70

and your way of getting the data out makes more sense than mine. It's fine with me if you accept your own answer. – tacaswell Oct 06 '13 at 22:50
1

This was the clue I needed. In my case I have a list of counts, and bin ranges: `plt.hist(bins, bins=len(bins), weights=counts)` was the invocation I needed – Ash Berlin-Taylor Nov 08 '17 at 17:18
Word of warning: I have noticed that this gives incorrect result if bins have different size, and `density=True` is used. Probably not a bug, rather a mathematical difference between pdf and cdf. – icemtel Nov 24 '20 at 10:08

score 6 · Answer 3 · edited Jun 12 '21 at 18:44

6

You can also use seaborn to plot the histogram :

import seaborn as sns

sns.distplot(
    list(
        counted_data.keys()
    ), 
    hist_kws={
        "weights": list(counted_data.values())
    }
)

edited Jun 12 '21 at 18:44

macrocosme

473
7
24

answered Apr 16 '18 at 12:20

youssef mhiri

133
3
11

score 4 · Answer 4 · answered Nov 07 '17 at 06:42

the length of the "bins" array should be longer than the length of "counts". Here's the way to fully reconstruct the histogram:

import numpy as np
import matplotlib.pyplot as plt
bins = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype(float)
counts = np.array([5, 3, 4, 5, 6, 1, 3, 7]).astype(float)
centroids = (bins[1:] + bins[:-1]) / 2
counts_, bins_, _ = plt.hist(centroids, bins=len(counts),
                             weights=counts, range=(min(bins), max(bins)))
plt.show()
assert np.allclose(bins_, bins)
assert np.allclose(counts_, counts)

score 0 · Answer 5 · answered May 08 '21 at 01:48

Adding to tacaswell's comment, plt.bar can be much more efficient than plt.hist here for large numbers of bins (>1e4). Especially for a crowded random plot where you only need plot the highest bars because the width required to see them will cover most of their neighbors anyway. You can pick out the highest bars and plot them with

i, = np.where(vals > min_height)
plt.bar(i,vals[i],width=len(bins)//50)

Other statistical trends may prefer to instead plot every 100th bar or something similar.

The trick here is that plt.hist wants to plot all of your bins whereas plt.bar will let you just plot the sparser set of visible bins.

score 0 · Answer 6 · answered Aug 07 '22 at 00:42

0

hist uses bar under the hood, this will produce something similar to what hist creates (assumes bins of equal size):

bins = [1,2,3]
heights = [10,20,30]

ax = plt.gca()
ax.bar(bins, heights, align='center', width=bins[-1] - bins[-2])

answered Aug 07 '22 at 00:42

Eduardo

1,383
8
13

Plotting a histogram from pre-counted data in Matplotlib

6 Answers6

Linked

Related