p_value is 0 when I use scipy.stats.kstest() for large dataset

Question

I have a unique series with there frequencies and want to know if they are from normal distribution so I did a Kolmogorov–Smirnov test using scipy.stats.kstest. Since, to my knowledge, the function takes only a list so I transform the frequencies to a list before I put it into the function. However, the result is weird since the pvalue=0.0

The histogram of the original data and my code are in the followings: Histogram of my dataset

[In]: frequencies = mp[['c','v']]

[In]: print frequencies
         c      v
31  3475.8   18.0
30  3475.6   12.0
29  3475.4   13.0
28  3475.2    8.0
20  3475.0   49.0
14  3474.8   69.0
13  3474.6   79.0
12  3474.4   78.0
11  3474.2   78.0
7   3474.0  151.0
6   3473.8  157.0
5   3473.6  129.0
2   3473.4  149.0
1   3473.2  162.0
0   3473.0  179.0
3   3472.8  145.0
4   3472.6  139.0
8   3472.4   95.0
9   3472.2  103.0
10  3472.0  125.0
15  3471.8   56.0
16  3471.6   75.0
17  3471.4   70.0
18  3471.2   70.0
19  3471.0   57.0
21  3470.8   36.0
22  3470.6   22.0
23  3470.4   20.0
24  3470.2   12.0
25  3470.0   23.0
26  3469.8   13.0
27  3469.6   17.0
32  3469.4    6.0

[In]: testData = map(lambda x: np.repeat(x[0], int(x[1])), frequencies.values)

[In]: testData = list(itertools.chain.from_iterable(testData))

[In]: print len(testData)
2415

[In]: print np.unique(testData)
[ 3469.4  3469.6  3469.8  3470.   3470.2  3470.4  3470.6  3470.8  3471.
  3471.2  3471.4  3471.6  3471.8  3472.   3472.2  3472.4  3472.6  3472.8
  3473.   3473.2  3473.4  3473.6  3473.8  3474.   3474.2  3474.4  3474.6
  3474.8  3475.   3475.2  3475.4  3475.6  3475.8]

[In]: scs.kstest(testData, 'norm')
KstestResult(statistic=1.0, pvalue=0.0)

Thanks everyone at first.

James · Accepted Answer · 2017-10-11T10:50:53.457

8

Using 'norm' for your input will check if the distribution of your data is the same as scipy.stats.norm.cdf with default parameters: loc=0, scale=1.

Instead, you will need to fit a normal distribution to your data and then check if the data and the distribution are the same using the Kolmogorov–Smirnov test.

import numpy as np
from scipy.stats import norm, kstest
import matplotlib.pyplot as plt

freqs = [[3475.8, 18.0], [3475.6, 12.0], [3475.4, 13.0], [3475.2, 8.0], [3475.0, 49.0],
    [3474.8, 69.0], [3474.6, 79.0], [3474.4, 78.0], [3474.2, 78.0], [3474.0, 151.0],
    [3473.8, 157.0], [3473.6, 129.0], [3473.4, 149.0], [3473.2, 162.0], [3473.0, 179.0],
    [3472.8, 145.0], [3472.6, 139.0], [3472.4, 95.0], [3472.2, 103.0], [3472.0, 125.0],
    [3471.8, 56.0], [3471.6, 75.0], [3471.4, 70.0], [3471.2, 70.0], [3471.0, 57.0],
    [3470.8, 36.0], [3470.6, 22.0], [3470.4, 20.0], [3470.2, 12.0], [3470.0, 23.0],
    [3469.8, 13.0], [3469.6, 17.0], [3469.4, 6.0]]

data = np.hstack([np.repeat(x,int(f)) for x,f in freqs])
loc, scale = norm.fit(data)
# create a normal distribution with loc and scale
n = norm(loc=loc, scale=scale)

Plot the fit of the norm to the data:

plt.hist(data, bins=np.arange(data.min(), data.max()+0.2, 0.2), rwidth=0.5)
x = np.arange(data.min(), data.max()+0.2, 0.2)
plt.plot(x, 350*n.pdf(x))
plt.show()

This not a terribly good fit, most due to the long tail on the left. However, you can now run a proper Kolmogorov–Smirnov test using the cdf of the fitted normal distribution

kstest(data, n.cdf)
# returns:
KstestResult(statistic=0.071276854859734784, pvalue=4.0967451653273201e-11)

So we are still rejecting the null hypothesis of the distribution that produced the data being the same as the fitted distribution.

edited Oct 11 '17 at 10:50

answered Oct 11 '17 at 03:00

James

32,991
4
47
70

Thanks James. It really answered my questions. However, what is the n in the line plt.plot(x, 350*n.pdf(x)) and kstest(data, n.cdf)? If it is the function norm, as I tried, should it be plt.plot(x, 350*norm.pdf(x, loc=loc, scale=scale)). Otherwise, it returns all 0 values. – Gabriel Oct 11 '17 at 04:38
oh, my mistake! i left off a line, fixed. `n` is the normal distribution given by `loc` and `scale` – James Oct 11 '17 at 10:49
Great. I see. Thanks a lot. – Gabriel Oct 12 '17 at 01:06
Hi Jame, actually I am still confused on why you time 350 when you did the plot at plt.plot(x, 350*n.pdf(x)) ? – Gabriel Oct 12 '17 at 08:51
The `pdf` is normalized such that the integral over all of x equals 1. I multiplied by 350 to make it's size comparable to the bars. – James Oct 12 '17 at 10:57
Did you determine 350 visually? Is there any ways I can determine it computationally? An approximate value should be fine. – Gabriel Oct 13 '17 at 11:20
Yes, it was just visually determined. there are lots of ways to do it computationally. the easiest is to normalize the histogram bars. – James Oct 13 '17 at 11:34

p_value is 0 when I use scipy.stats.kstest() for large dataset

1 Answers1