3

I'm using the Kolmogorov-Smirnov test in MATLAB to determine the normality of each column of a data matrix prior to performing generalised linear regression. An example data vector is:

data = [8126,3163,9129,5399,8682,1126,1053,7805,2989,2758,3277,1152,6994,6833];

The test runs and gives me a result. However, when I plot the empirical cumulative distribution function (cdf) (blue) and the standard normal cdf (red) for a visual comparison, the scale of such a data vector is such that the graph is not useful:

exampleCDF

The code used to plot this figure is:

[h,p,ksstat,cv] = kstest(data);
[f,x_values] = ecdf(data);
figure()
F = plot(x_values,f);
set(F,'LineWidth',2);
hold on
G = plot(x_values,normcdf(x_values,0,1),'r-');
set(G,'LineWidth',2);
legend([F G],...
    'Empirical CDF','Standard Normal CDF',...
    'Location','SE');

Does this mean the result of my test is not valid? If yes, can I just normalise the data e.g.

dataN=(data-min(data))./(max(data)-min(data)); 

while maintaining test validity?

Thank you for your time,

Laura

Laura
  • 89
  • 8
  • 3
    You are plotting the Gaussian CDF with zero mean and standard deviation `1`. So for data values of the order of thousands the CDF it is very approximately 1. You probably need to use the mean and standard deviation estimated from your data; or normalize the data and then you can keep the Gaussian CDF with zero mean and unit standard deviation – Luis Mendo Jun 06 '17 at 11:29
  • Of course! Thank you for your advice Luis - changing the mean and standard deviation fixed the problem – Laura Jun 06 '17 at 11:42
  • 2
    Anytime! You may want to answer yourself (I'm not sure how you are applying the mean and std dev exactly) and accept the answer so the question doesn't show up as unanswered – Luis Mendo Jun 06 '17 at 11:51

1 Answers1

3

Thanks to Luis Mendo I solved this problem. normcdf requires the mean and standard deviation of the data vector as inputs, which I had not changed from the example code I was working from. The edited code is:

[h,p,ksstat,cv] = kstest(data);
[f,x_values] = ecdf(data);
figure()
F = plot(x_values,f);
set(F,'LineWidth',2);
hold on
variableMean = mean(data);
variableSD = std(data);
G = plot(x_values,normcdf(x_values,variableMean,variableSD),'r-');
set(G,'LineWidth',2);
legend([F G],...
    'Empirical CDF','Standard Normal CDF',...
    'Location','SE');
Laura
  • 89
  • 8