How to implement a KS-Test in Python

Question

scipy.stats.kstest(rvs, cdf, N) can perform a KS-Test on a dataset rvs. It tests if the dataset follows a propability distribution, whose cdf is specified in the parameters of this method.

Consider now a dataset of N=4800 samples. I have performed a KDE on this data and, therefore, have an estimated PDF. This PDF looks an awful lot like a bimodal distribution. When plotting the estimated PDF and curve_fitting a bimodal distribution to it, these two plots are pretty much identical. The parameters of the fitted bimodal distribution are (scale1, mean1, stdv1, scale2, mean2, stdv2): [0.6 0.036 0.52, 0.23 1.25 0.4]

How can I apply scipy.stats.kstest to test if my estimated PDF is bimodal distributed? As my null hypothesis, I state that the estimated PDF equals the following PDF:

hypoDist = 0.6*norm(loc=0, scale=0.2).pdf(x_grid) + 0.3*norm(loc=1, scale=0.2).pdf(x_grid)
hypoCdf = np.cumsum(hypoDist)/len(x_grid)

x_grid is just a vector that contains the x-values at which I evaluate my estimated PDF. So each entry of pdf has a corresponding value of x_grid. It might be that my computation of hypoCdf is incorrect. Maybe instead of dividing by len(x_grid), should I divide by np.sum(hypoDist) ?

Challenge: cdf parameter of kstest cannot be specified as bimodal. Neither can I specify it to be hypoDist.

If I wanted to test whether my dataset was Gaussian distributed, I would write:

KS_result = kstest(measurementError, norm(loc=mean(pdf), scale=np.std(pdf)).cdf)
print(KS_result)

measurementError is the dataset that I have performed the KDE on. This returns: statistic=0.459, pvalue=0.0 To me, it is a little irritating that the pvalue is 0.0

You have `scale1` = 0.6 and `scale2` = 0.23 in the text, but those do not add to 1. And in the code that creates `hypoDist`, the scales are apparently 0.6 and 0.3, which also do not add to 1. Your bimodal distribution is a [mixture](https://en.wikipedia.org/wiki/Mixture_distribution) of two normal distributions. For a mixture distribution, the weights (or scales, as you call them) should sum to 1. Could you fix those values, or explain why the values you are using do not sum to 1? — Warren Weckesser, May 14 '19 at 07:31
@WarrenWeckesser: You're right, that's irritating. The parameters that I am using in the code have been initial guesses from me, based on the plot of the estimated PDF. The parameters `0.6` and `0.23` have been returned by my curve-fit: I have fitted a bimodal distribution to the estimated PDF, and these are the parameters of the fit — newbie, May 14 '19 at 08:00
That's a problem. If the values do not sum to 1, your combined PDFs do not create a proper probability density function. Your curve fitting procedure should include a constraint that those values sum to 1. Or just use only the parameter `scale1`, and replace `scale2` with `1 - scale1`. — Warren Weckesser, May 14 '19 at 08:03
ok. I was just using `popt_bimodal, pcov_bimodal = curve_fit(bimodal, x_grid, pdf, p0=[0.6, 0, 0.2, 0.3, 1, 0.2])`. Here, `bimodal` is a function that I have defined. It simply adds two Gaussians. ... ah: u r saying curve_fits fits a curve to my data, but that curve is no bimodal distribution! — newbie, May 14 '19 at 08:08
Yes--`curve_fit` does not know that your function is a PDF whose integral over the real line must be 1. See my answer for one way that you can use just one weight. — Warren Weckesser, May 14 '19 at 08:25

Warren Weckesser · Answer 1 · 2019-05-14T08:33:19.260

The cdf argument to kstest can be a callable that implements the cumulative distribution function of the distribution against which you want to test your data. To use it, you have to implement the CDF of your bimodal distribution. You want the distribution to be a mixture of two normal distributions. You can implement the CDF for this distribution by computing the weighted sum of the CDFs of the two normal distributions that make up the mixture.

Here's a script that shows how you can do this. To demonstrate how kstest is used, the script runs kstest twice. First it uses a sample that is not from the distribution. As expected, kstest computes a very small p-value for this first sample. It then generates a sample that is drawn from the mixture. For this sample, the p-value is not small.

import numpy as np
from scipy import stats


def bimodal_cdf(x, weight1, mean1, stdv1, mean2, stdv2):
    """
    CDF of a mixture of two normal distributions.
    """
    return (weight1*stats.norm.cdf(x, mean1, stdv1) +
            (1 - weight1)*stats.norm.cdf(x, mean2, stdv2))


# We only need weight1, since weight2 = 1 - weight1.
weight1 = 0.6
mean1 = 0.036
stdv1 = 0.52
mean2 = 1.25
stdv2 = 0.4

n = 200

# Create a sample from a regular normal distribution that has parameters
# similar to the bimodal distribution.
sample1 = stats.norm.rvs(0.5*(mean1 + mean2), 0.5, size=n)

# The result of kstest should show that sample1 is not from the bimodal
# distribution (i.e. the p-value should be very small).
stat1, pvalue1 = stats.kstest(sample1, cdf=bimodal_cdf,
                              args=(weight1, mean1, stdv2, mean2, stdv2))
print("sample1 p-value =", pvalue1)

# Create a sample from the bimodal distribution.  This sample is the
# concatenation of samples from the two normal distributions that make
# up the bimodal distribution.  The number of samples to take from the
# first distributions is determined by a binomial distribution of n
# samples with probability weight1.
n1 = np.random.binomial(n, p=weight1)
sample2 = np.concatenate((stats.norm.rvs(mean1, stdv1, size=n1),
                         (stats.norm.rvs(mean2, stdv2, size=n - n1))))

# Most of time, the p-value returned by kstest with sample2 will not
# be small.  We expect the value to be uniformly distributed in the interval
# [0, 1], so in general it will not be very small.
stat2, pvalue2 = stats.kstest(sample2, cdf=bimodal_cdf,
                              args=(weight1, mean1, stdv1, mean2, stdv2))
print("sample2 p-value =", pvalue2)

Typical output (the numbers will be different each time the script is run):

sample1 p-value = 2.8395166853884146e-11
sample2 p-value = 0.3289374831186403

You might find that, for your problem, this test does not work well. You have 4800 samples, but in your code you have parameters whose numerical values have just one or two significant digits. Unless you have good reason to believe that your sample is drawn from a distribution with exactly those parameters, it is likely that kstest will return a very small p-value.

"That is one of the ironies of using a test like this with a large sample." Why would that be an irony? isn't it a good thing to be so sensitive? What would be the alternatives? — NoDataDumpNoContribution, May 14 '19 at 08:25
@WarrenWeckesser: Can the p-value be exactly 0.000 ? I get this result when testing for a Gaussian distribution, which my data is most probably not drawn from — newbie, May 14 '19 at 08:58
@newbie, numerically, yes. If you get 0, it means the floating point calculation "underflowed" to 0. The smallest positive value that can be represented with 64 bit floating point is 5e-324. — Warren Weckesser, May 14 '19 at 09:01
and the meaning of this is: we reject the null hypothesis whenever the returned pvalue is smaller than our defined alpha (significance level, usually 5%). My pvalue is 0, hence so small that it is very unlikely that my distributon has been drawn from a Gaussian / Bimodal (returned pvalue for Bimodal is 0 as well). The pvalue states the probability with which our data could still be drawn from the hypothesized distribution, doesn't it? — newbie, May 14 '19 at 09:26
why do the two weights of the bimodal distribution have to add up to 1 ? When I consider just one Gaussian, its weight does not necessarily need to be 1, it could be less. — newbie, May 14 '19 at 12:22
@newbie Every valid distribution has integral (or summation) over all possible values equal to 1. This is usually called the normalization constraint or similar terms. In order for a mixture distribution to be normalized, the mixing weights must add up to 1. It's easy to see why: suppose p(x) = a1 p1(x) + a2 p2(x) where a1, a2 are the mixing weights and p1, p2 are the constituent distributions. Then integral(p(x), x) = a1 integral(p1(x), x) + a2 integral(p2(x), x) = a1 + a2 since p1 and p2 are normalized. — Robert Dodier, May 16 '19 at 00:53
@Trilarion It may not be ironic, but anyway it is known a priori that the null hypothesis is false in this case, since any real distribution cannot be exactly a mixture of Gaussians, therefore for a large enough sample, one must reject the null hypothesis. This is a well-known feature/bug in significance tests. My advice to OP, or anyone, is to dump the significance testing stuff, and go for a Bayesian decision theoretic approach, which allows for expressing practical significance. — Robert Dodier, May 16 '19 at 00:57
@RobertDodier "...a Bayesian decision theoretic approach, which allows for expressing practical significance." Just out of curiosity. What resources (textbooks, concepts) would you recommend to read more about that? — NoDataDumpNoContribution, May 16 '19 at 07:37
@Trilarion Take a look at "Making Hard Decisions" by Robert Clemen. — Robert Dodier, May 16 '19 at 21:55

How to implement a KS-Test in Python

1 Answers1