0

changed the code with the Gaussian args considering Sam Masons comment. The results are still wrong, since I know from QQ-plots the data is probably a decent Gaussian. I will try to post my updated code and attach the data file too. Perhaps it's obvious but I don't see how the KS-test gets it so wrong (or I). The .csv datafile can be found here: https://ln5.sync.com/dl/658503c20/5fek5x39-y8aqbkfu-tqptym98-nz75wikq

import pandas as pd
import numpy as np
alpha = 0.05
df = pd.read_csv("Z079_test_mc.csv")
columns = df.columns
with open('matrix.txt', 'a') as f:
    for col in columns:
        print ([col])
        a, b = stats.kstest(df[[col]].dropna().values, stats.norm.cdf, args=(np.mean(df[col]),np.std(df[col])))
        print('Statistics', a, 'p-value', b)
        if b < alpha:
            print('The null hypothesis can be rejected' + '\n')
            f.write(str(col) + ',' + 'Kolmogorov Smirnov' + '\n' + \
                '        ' + ',' + str(a) + ',' + str(b) + 'The null hypothesis can be rejected' + '\n')
        else:
            print('The null hypothesis cannot be rejected')
            f.write(str(col) + ',' + 'Kolmogorov Smirnov' + '\n' + \
                '        ' + ',' + str(a) + ',' + str(b) + 'The null hypothesis cannot be rejected' + '\n')
  • Your problem is not reproducible, since you’ve only given one line of data. I don’t use scipy myself, but looking at the documentation it appears the parameters for `norm` are `loc` and `scale` (mean and std deviation), but you’re supplying `min` and `max` of the data. What happens if you drop `args`, which is optional? – pjs Aug 11 '22 at 16:19
  • Hi, dropping args results in the test function only returning "Test statistic=1.0" and "p-value = 0.0", so I started to use args (not correctly as I am aware now). But the new results are also not productive. See my edited OP. – NotAnotherName Aug 12 '22 at 20:04
  • I downloaded your CSV file and loaded it into a professional stats package (JMP). None of your three columns look remotely close to normal with histograms or with distribution fitting options. You should accept the KS results you're getting. – pjs Aug 12 '22 at 20:33
  • Try looking at q-q-plots. The data points lie pretty close on the respective (Z-mu/sigma) lines. On the other hand, Anderson Darling test confirmed a Gaussian. I really don't know why Kolmogorov is so different. Since I have to confirm the Gaussian for work stuff, it's quite a mess for me. Is there a trial version of JMP? If it's for professional use, I have to try to trust this program package and take the result for granted then (no Gaussian). – NotAnotherName Aug 13 '22 at 09:26

2 Answers2

0

The parameters for a Gaussian distribution in SciPy are the location and scale. In stats speak these are mu and sigma. Hence passing the min and max as args is breaking things.

Probably easiest is just to use args=stats.norm.fit(values), or you could do it manually via args=(np.mean(values), np.std(values)). As a more complete example:

import numpy as np
import scipy.stats as sps

# generate some values from something almost Gaussian
#   1 = Cauchy, +Inf = Gaussian
values = 1e9 + np.random.standard_t(10, size=1000) * 1e9

# perform test
sps.kstest(values, 'norm', sps.norm.fit(values))

or

# parameterize distribution
dist = sps.norm(*sps.norm.fit(values))

# perform test
sps.kstest(values, dist.cdf)
Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • Hey, thanks, I was thinking of the args as clarifying which range the Kolmogorov test has to use, but obvioulsy as you said it's for the Gaussian. I think it works better now, but the values suggest I don't have a Gaussian (Test stats nearly one and p-value zero) sample data which is most definitely the wrong conclusion. I will try to change my post and include everything also the data. – NotAnotherName Aug 12 '22 at 12:18
0

I don't know what's going on with Python's KS test aside from your initial use of min/max rather than location/scale as arguments. A quick web review seemed to indicate that Shapiro-Wilk test is preferred over KS for sample sizes < 50, which you have.

I did a quick analysis in JMP, and have pasted the results below. I suspect your results are inconclusive due to the small sample sizes. My experience with distribution fitting for simulation models is that the results are often ambiguous unless you have sample sizes in the hundreds or even thousands. With sample sizes in the 20s-40s, each histogram bin only has a few observations in it. With that said, normality was not the top choice for any of your three columns of data. I've provided histograms with both the recommended best fit and the best fit normal superimposed, along with QQ plots and associated test statistics for recommended and normal.

Despite inconclusive statistical tests on two of the three columns of data, I stand by what I said in comments -- the histograms do not look normal. The Z79V001 data is heavy in the tails and has a huge dip near what should be the mode; the Z79V0003_1 data looks multimodal with big gaps; and the Z79V0003_2 data is clearly skewed right (plus it fails the Shapiro-Wilk test at the 0.05 level even with a very small sample size).

Without further ado, here are screenshots:

Z79V0001 results

Z79V0003_1 results

Z79V0003_2 results

pjs
  • 18,696
  • 4
  • 27
  • 56
  • many thanks for your effort! I come to the following conclusions regarding your new evaluations: I compared your values for Shapiro Wilk and Anderson Darling and Python gives exactly the same numbers. Also the last data set seems to fail the Saphiro Wilk test and Anderson is barely fulfilled. I agree on your remark on the visual inspection of the data. It also doesn't look gaussian to me. Kolmogorov aside, the conclusion is that if the data doesn't even look Gaussian by inspection the numerical tests (even if they confirm normal distribution) are of little value? – NotAnotherName Aug 13 '22 at 19:27
  • I would say to know the limits of the tests rather than that they are of little value. If you have fewer than 100 observations, you won't have more than a few observations outside two sigma for the normal distribution. With 40 observations, a histogram either has an average of 2 to 3 observations per bin, or you end up with only a tiny number of low resolution bins. My advice would be to see if you can get more data (hundreds of observations), because statistical goodness of fit tests are pretty weak without sufficient data. – pjs Aug 13 '22 at 19:36
  • Ok, that makes sense. I will have to examine A LOT of more data, there are also datasets with about 80-100+ observations which should be more straightforward to analyze considering the statistical resolution. I am still troubled that the original paper of Miller ("Leslie H. Miller (1956) Table of Percentage Points of Kolmogorov Statistics,Journal of the American Statistical Association, 51:273, 111-121") lists a table for epsilon (the critical value to look after if the statistics is > or <) which only covers sample ranges from 1-100 if it only gets interesting (significant) 100 and above. – NotAnotherName Aug 13 '22 at 20:16
  • KS is recommended as an option for sample sizes over 50. You can still apply it below that but it becomes quite conservative, meaning that it's unlikely to reject the null hypothesis unless there are some pretty extreme deviations observed. – pjs Aug 13 '22 at 21:06
  • Ah, thanks for the remark. Anyway, I have to discuss these problems tomorrow with my colleague who invented this scheme (namely assigning certain datasets to distinct gaussians). After all this discussion it's doubtful that it will be unequivocal for the rest of the data. I hope he does not go ballistic. But facts are facts after all. – NotAnotherName Aug 14 '22 at 08:33