plt.hist() vs np.histogram() - unexpected results

Question

The following lines

a1, b1, _ = plt.hist(df['y'], bins='auto')
a2, b2 = np.histogram(df['y'], bins='auto')

print(a1 == a2)
print(b1 == b2)

equate to all values of a1 being equal to those of a2 and the same for b1 and b2

I then create a plot using pyplot alone (using bins=auto should use the same np.histogram() function):

plt.hist(df['y'], bins='auto')
plt.show()

I then try to achieve the same histogram, but by calling np.histogram() myself, and passing the results into plt.hist(), but I get a blank histogram:

a2, b2 = np.histogram(df['y'], bins='auto')
plt.hist(a2, bins=b2)
plt.show()

From how I understand that plt.hist(df['y'], bins='auto') works, these two plots I am creating should be exactly the same - why isn't my method of using Numpy working?

EDIT

Following on from @MSeifert's answer below, I believe that for

counts, bins = np.histogram(df['y'], bins='auto')

bins is a list of the starting value for each bin, and counts is the corresponding number of values in each of these bins. As shown from my histogram above, this should produce a nearly perfect normal distribution, however, if call print(counts, bins) the result of counts shows that the very first and last bins have quite a substantial count of ~11,000. Why isn't this reflected in the histogram - why is there not two large spikes at either tail?

EDIT 2

It was just a resolution issue and my plot was seemingly too small for the spikes at either end to render correctly. Zooming in allowed them to display.

MSeifert · Accepted Answer · 2017-10-09T22:53:26.953

11

You're assuming that plt.hist can differentiate between an array containing counts as values and an array containing values to count.

However that's not what happens, when you pass the counts to plt.hist it will count them and place them in the provided bins. That can lead to empty histograms but also to weird histograms.

So while plt.hist and numpy.histogram both work the same you cannot just pass the data obtained from numpy.histogram to plt.hist because that would count the counts of the values (not what you expect):

import numpy as np
import matplotlib.pyplot as plt

%matplotlib notebook

f, ax = plt.subplots(1)
arr = np.random.normal(10, 3, size=1000)
cnts, bins = np.histogram(arr, bins='auto')
ax.hist(cnts, bins=bins)

However you can use a bar plot to vizualize histograms obtained by numpy.histogram:

f, (ax1, ax2) = plt.subplots(2)
cnts, bins = np.histogram(arr, bins='auto')
ax1.bar(bins[:-1] + np.diff(bins) / 2, cnts, np.diff(bins))
ax2.hist(arr, bins='auto')

edited Oct 09 '17 at 22:53

answered Oct 09 '17 at 22:47

MSeifert

145,886
38
333
352

Upon further inspecting my data, I have become unsure of another related aspect to my original question. Please see my edit. – KOB Oct 09 '17 at 23:14
I really don't know why these don't display when you plot the histogram (they do show up at my computer - maybe it's a resolution thing). But I think that edit would be better as new question. It would get more attention and you don't mix several questions into one. Feel free to leave a link here, that sounds interesting and I would like to know the answer too :) – MSeifert Oct 09 '17 at 23:32
Ah yes, it is just a resolution issue. My plot was just initially too small, so when I zoom in or set it to fullscreen, the two spikes at the first and last bin show. Thanks. – KOB Oct 09 '17 at 23:38

plt.hist() vs np.histogram() - unexpected results

1 Answers1