0

I am trying to perform KS test goodness of fit for my data and estimated distribution. Plot is like this enter image description here

The code I am using and the results are as follows:

sp.stats.kstest(df['col'], 'norm', args = (mean, sd), N = 1000000)

KstestResult(statistic=0.06905359838747682, pvalue=0.0)

  • from df I am taking my data points.
  • 'norm' because I assume normal distribution.
  • args is a tuple with
  • parameters for theoretical distribution function I estimated using my dataset.
  • N = 1000000 as a sample size.

Of course, the fit is not perfect, but I cannot understand why the p-value is just 0.0. Am I doing something wrong using the function or the fit is that bad? I would expect p-value to be small, even as small as 0.01 or 0.000000536 or whatever, but not dead nil.

Any ideas what is wrong or what can be done to make it work?

BTW: the raw data is originally log-normal distributed (looking at the original, here in the plot it is after log transformation)

Bonzogondo
  • 143
  • 10
  • 2
    I have no experience in KS, but IMO this fit is VERY bad, and as such a p value that is so small it goes to 0 doesn't surprise me... – Julien Jul 04 '18 at 23:43
  • The p-value issue aside: In a two-sample KS test the statistic of interest characterises the distance between two *cdfs*; so I'm not sure why you're plotting a *pdf*. – Maurits Evers Jul 05 '18 at 03:48
  • FYI: From the description of the parameter `N` in the [`kstest` docstring](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html): *Sample size if `rvs` is string or callable. Default is 20.* `rvs` is the first argument, which is `df['col']` in your code. You say this is your set of data points, so it is not a string or a callable. In that case , `N` is ignored. – Warren Weckesser Jul 05 '18 at 04:11
  • 1
    As Julien noted, the fit is qualitatively pretty bad--there is obvious asymmetry in the histogram. In such a case, the more data you have, the lower the p-value will be, as more points add more "evidence" that the distribution does not match the data. To investigate this, you could try applying the KS test to, say, 1/20 of your data points (selected at random), then 1/10, then 1/5, etc. As you use more points, you'll see the p-value decrease, – Warren Weckesser Jul 05 '18 at 04:22

0 Answers0