14

Just to note, I have already checked this question and this question.

So, I'm using distplot to draw some histograms on separate subplots:

import numpy as np
#import netCDF4 as nc # used to get p0_dict
import matplotlib.pyplot as plt
from collections import OrderedDict
import seaborn.apionly as sns
import cPickle as pickle

''' 
LINK TO PICKLE
https://drive.google.com/file/d/0B8Xks3meeDq0aTFYcTZEZGFFVk0/view?usp=sharing
'''

p0_dict = pickle.load(open('/path/to/pickle/test.dat', 'r'))     

fig = plt.figure(figsize = (15,10))
ax = plt.gca()
j=1

for region, val in p0_dict.iteritems():

    val = np.asarray(val)

    subax = plt.subplot(5,5,j)

    print region

    try:              
        sns.distplot(val, bins=11, hist=True, kde=True, rug=True, 
                     ax = subax, color = 'k', norm_hist=True)

    except Exception as Ex:
        print Ex

    subax.set_title(region)
    subax.set_xlim(0, 1) # the data varies from 0 to 1

    j+=1    

plt.subplots_adjust(left = 0.06, right = 0.99, bottom = 0.07,
                    top = 0.92, wspace = 0.14, hspace = 0.6) 

fig.text(0.5, 0.02, r'$ P(W) = 0,1 $', ha ='center', fontsize = 15)
fig.text(0.02, 0.5, '% occurrence', ha ='center', 
         rotation='vertical', fontsize = 15) 
# obviously I'd multiply the fractional ticklabels by 100 to get 
# the percentage...

plt.show()

What I expect is for the area under the KDE curve to sum to 1, and for the y axis ticklabels to reflect this. However, I get the following:

enter image description here

As you can see, the y axis ticklabels are not in the range [0,1], as would be expected. Turning on/off norm_hist or kde does not change this. For reference, the output with both turned off:

enter image description here

Just to verify:

aus = np.asarray(p0_dict['AUS'])
aus_bins = np.histogram(aus, bins=11)[0]

plt.subplot(121)
plt.hist(aus,11)
plt.subplot(122)
plt.bar(range(0,11),aus_bins.astype(np.float)/np.sum(aus_bins))

plt.show()

enter image description here

The y ticklabels in this case properly reflect those of a normalised histogram.

What am I doing wrong?

Thank you for your help.

Community
  • 1
  • 1
areuexperienced
  • 1,991
  • 2
  • 17
  • 27

1 Answers1

28

The y axis is a density, not a probability. I think you are expecting the normalized histogram to show a probability mass function, where the sum the bar heights equals 1. But that's wrong; the normalization ensures that the sum of the bar heights times the bar widths equals 1. This is what ensures that the normalized histogram is comparable to the kernel density estimate, which is normalized so that the area under the curve is equal to 1.

mwaskom
  • 46,693
  • 16
  • 125
  • 127
  • Thanks for the clarification @mwaskom. I suppose I could add a separate axis to represent the probability as well? Sorry if this seems a bit confused, I'm not very familiar with KDE as a technique. – areuexperienced Sep 01 '15 at 10:02
  • 3
    This doesn't have anything to do with KDE per se. I would read up on the distinction between probability and density, it is a confusing issue and one that comes up very often on the stats version of this site (e.g. [here](http://stats.stackexchange.com/questions/4220/can-a-probability-distribution-value-exceeding-1-be-ok)), although I'm sure there are good explanations elsewhere too. – mwaskom Sep 01 '15 at 14:39