4

I have 1000 large numbers, randomly distributed in range 37231 to 56661.

I am trying to use the stats.gaussian_kde but something does not work. (maybe because of my poor knowledge of statistics?).

Here is the code:

from scipy import stats.gaussian_kde
import matplotlib.pyplot as plt

# 'data' is a 1D array that contains the initial numbers 37231 to 56661
xmin = min(data)
xmax = max(data)   

# get evenly distributed numbers for X axis.
x = linspace(xmin, xmax, 1000)   # get 1000 points on x axis
nPoints = len(x)

# get actual kernel density.
density = gaussian_kde(data)
y = density(x)

# print the output data
for i in range(nPoints):
    print "%s   %s" % (x[i], y[i])

plt.plot(x, density(x))
plt.show()

In the printout, I get x values in the column 1, and zeros in the column 2. The plot shows a flat line.

I simply can not find the solution. I tried a very wide range of X-es, the same result.

What is the problem? What am I doing wrong? Could the large numbers be the cause?

sarnold
  • 102,305
  • 22
  • 181
  • 238
Proteos
  • 83
  • 1
  • 8
  • Note formatting mistake near the top; you can select all the code and hit `{}` button to add the necessary four spaces before every line. – sarnold Mar 21 '12 at 23:38
  • @sarnold, I am sorry, which mistake do you mean? I actually used that {} button and at least on my Mac the formatting looks fine. (I am a newbie here and I apologize in advance for the mistake though) – Proteos Mar 23 '12 at 03:47
  • @Proteos: look at the first line, beginning "from scipy import..". It's not marked up as code. – DSM Mar 23 '12 at 18:56
  • Ha! That'll teach me to give advice without looking at the source code; you have to leave a blank line before source code. It's a bit silly, but you're right, the code was all there... – sarnold Mar 23 '12 at 23:05

2 Answers2

8

I think what's happening is that your data array is made up of integers, which leads to problems:

>>> import numpy, scipy.stats
>>> 
>>> data = numpy.random.randint(37231, 56661,size=10)
>>> xmin, xmax = min(data), max(data)
>>> x = numpy.linspace(xmin, xmax, 10)
>>> 
>>> density = scipy.stats.gaussian_kde(data)
>>> density.dataset
array([[52605, 45451, 46029, 40379, 48885, 41262, 39248, 38247, 55987,
        44019]])
>>> density(x)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

but if we use floats:

>>> density = scipy.stats.gaussian_kde(data*1.0)
>>> density.dataset
array([[ 52605.,  45451.,  46029.,  40379.,  48885.,  41262.,  39248.,
         38247.,  55987.,  44019.]])
>>> density(x)
array([  4.42201513e-05,   5.51130237e-05,   5.94470211e-05,
         5.78485526e-05,   5.21379448e-05,   4.43176188e-05,
         3.66725694e-05,   3.06297511e-05,   2.56191024e-05,
         2.01305127e-05])
DSM
  • 342,061
  • 65
  • 592
  • 494
  • Oh, what a naive mistake! I thought I was missing something simple but that simple? :) On the other hand, the gaussian_kde() function should have taken care of converting to float; at least give the warning that it needs floats. Dont' you agree? Well, The case is solved! Thank you very much! – Proteos Mar 23 '12 at 03:34
  • @Proteos: I agree, this looks like a bug to me. – DSM Mar 23 '12 at 03:43
  • Nice catch. I could see the argument go either way: the results are being returned with as many digits after the decimal as your input data, and `1e-05` is near enough to `0` for many uses, especially when the inputs are so far from `0`. This is definitely surprising though. – sarnold Mar 23 '12 at 23:07
  • This has been fixed in scipy and will convert to floats in the next release (scipy 0.10) – Josef Mar 25 '12 at 15:26
3

I have made a function to do this. You can vary the bandwidth as a parameter of the function. That is, smaller number = more pointy, larger number = smoother. The default is 0.3.

It works in IPython notebook --pylab=inline

The number of bins is optimized and coded so will vary on the number of variables in your data.

import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np

def hist_with_kde(data, bandwidth = 0.3):
    #set number of bins using Freedman and Diaconis
    q1 = np.percentile(data,25)
    q3 = np.percentile(data,75)


    n = len(data)**(.1/.3)
    rng = max(data) - min(data)
    iqr = 2*(q3-q1)
    bins = int((n*rng)/iqr)

    x = np.linspace(min(data),max(data),200)

    kde = stats.gaussian_kde(data)
    kde.covariance_factor = lambda : bandwidth
    kde._compute_covariance()

    plt.plot(x,kde(x),'r') # distribution function
    plt.hist(data,bins=bins,normed=True) # histogram

data = np.random.randn(500)
hist_with_kde(data,0.25)
E. Zeytinci
  • 2,642
  • 1
  • 20
  • 37
John
  • 41,131
  • 31
  • 82
  • 106