I have a data-set of experimental data (y_observed, t whereby y_observed represents measured values and t represents the time in seconds since start of the measurement). I perform a Gaussian fitting to this data, but want to assess whether it makes sense to fit a Gaussian or whether the observed experimental data is non-Gaussian.
I've tried to implement the 2 sample Kolmogorov-Smirnov goodness of fit test in Python. However, I am wondering whether I am implementing and interpreting this test correctly. A small example can be found below, whereby y_observed does not look like a Gaussian (see attached figure) but according to the 2 sample K-S test y_observed does follow a Gaussian distribution. Can someone please explain to me what I'm doing wrong? Thanks in advance!
from scipy.stats import ks_2samp
import numpy as np
y_observed = [
1.93291, 9.84869, 29.5713, 20.6346, -1.51537,
0.396292, 50.8895, 68.855, 63.9291, 55.6863,
37.6503, 46.87, 33.6637, 25.0395,18.3027,
38.0947, 27.9305, 10.2012, -0.704519, 18.6656,
20.2873, 8.78955, 4.39672
]
t = [
519, 522, 525, 528, 531,
534, 537, 540, 543, 546,
549, 552, 555, 558 ,561,
564, 567, 570, 573, 576,
579,582,585
]
y_fit = [
5.60417, 8.74951, 12.9907, 18.3426, 24.63,
31.4518, 38.1947, 44.1102, 48.4454, 50.5991,
50.2586, 47.4739, 42.6458, 36.4314, 29.5973,
22.8668, 16.8011, 11.7394, 7.80065, 4.92939,
2.96233, 1.69298, 0.920122
]
count, bins_count = np.histogram(y_observed, bins=5)
pdf = count / sum(count)
cdf = np.cumsum(pdf)
count_2, bins_count_2 = np.histogram(y_fit, bins=5)
pdf_2 = count_2 / sum(count_2)
cdf_2 = np.cumsum(pdf_2)
KS_stat, P_val = ks_2samp(cdf, cdf_2)
n_1, n_2 = len(y_observed), len(y_fit)
critical_val = np.sqrt((n_1+n_2)/(n_1*n_2))*1.36
if P_val<0.05 and KS_stat>critical_val:
print("Observed data does not follow a Gaussian distribution")
else:
print("Observed data follows a Gaussian distribution")