2

I am newbie in statistics topic, so I guess it might be something obvious that I am missing here.

Basically I would like to examine if some double array of integer values (histogram) conforms to Normal distribution (mean and standard deviation is specified) with some significance level, basing on Statistical tests from Apache Commons Math.

What I already understand is that common way is to calculate p-value and then decide if null hypothesis is true or not.

My first "baby" step is to check if two arrays are coming from the same distribution using One-Way ANOVA test (second part is taken from example in documentation):

double samples1[] = new double[100];
double samples2[] = new double[100];

Random rand = new Random();
for (int i = 0; i < 100000; i++) {
    int index1 = (int) (rand.nextGaussian()*5 + 50);
    int index2 = (int) (rand.nextGaussian()*5 + 50);
    try {
        samples1[index1-1]++;
    }
    catch (ArrayIndexOutOfBoundsException e) {}
    try {
        samples2[index2-1]++;
    }
    catch (ArrayIndexOutOfBoundsException e) {}
}

List classes = new ArrayList<>();
classes.add(samples1);
classes.add(samples2);

double pvalue = TestUtils.oneWayAnovaPValue(classes);
boolean fail = TestUtils.oneWayAnovaTest(classes, 0.05);

System.out.println(pvalue);
System.out.println(fail);

The result is:

1.0
false

Assuming that significance level is 0.05 I can deduce that hypothesis is true (i.e. both arrays are from the same distribution) as p > 0.05.

Now let's take Kolmogorov-Smirnov test. Example code in documentation shows how to check single array against some NormalDistribution object (that is my goal). However it also allows to check two arrays. I cannot get proper result in both cases. For example let's adapt above example into K-S:

double samples1[] = new double[100];
double samples2[] = new double[100];

Random rand = new Random();
for (int i = 0; i < 100000; i++) {
    int index1 = (int) (rand.nextGaussian()*5 + 50);
    int index2 = (int) (rand.nextGaussian()*5 + 50);
    try {
        samples1[index1-1]++;
    }
    catch (ArrayIndexOutOfBoundsException e) {}
    try {
        samples2[index2-1]++;
    }
    catch (ArrayIndexOutOfBoundsException e) {}
}

double pvalue = TestUtils.kolmogorovSmirnovTest(samples1, samples2);
boolean fail = pvalue < 0.05;

System.out.println(pvalue);
System.out.println(fail);

Result is:

7.475142727031425E-11
true

My question is why p-value of essentially the same data is now so small? Does it mean it that this test is not suited for such type of data?

Should I:

  • Generate reference array of NormalDistribution (that is, with specified mean and standard devition) and then compare it to my array using One-Way ANOVA test (or other)
  • Somehow adapt my data and then use K-S compare single array against NormalDistribution object

?

Grzegorz Szpetkowski
  • 36,988
  • 6
  • 90
  • 137
  • I'm not sure what your goal is here, but it seems pointless to carry out such a test. Since your data are only integers, their distribution cannot be a Gaussian distribution. – Robert Dodier Apr 01 '15 at 16:49
  • @Robert Dodier: I need to check if such dataset is "close enough" to Gaussian distribution of specified mean and sd. I thought that simplest way would to create new object of `NormalDistribution` (it has constructor for those two parameters) and then use K-S method as they shown in documentation. – Grzegorz Szpetkowski Apr 01 '15 at 16:56
  • OK. My advice is to quantity what you mean by "close enough" and use that to asses the data. The answer to that question necessarily depends on the purpose to which you want to put these numbers. What is your goal here? What are you going to do if the data fail the test? – Robert Dodier Apr 01 '15 at 19:38
  • @RobertDodier: Thanks for your help. Please note that I am not a statistic expert, that was just my task "as-is" for some significance level. I was looking more to solve this and found [Chi-squared test](http://en.wikipedia.org/wiki/Chi-squared_test), that seems to be more relevant basing on [this discussion](http://mathforum.org/library/drmath/view/72065.html). This method compare integer data with integrals of Gauss density function. It's also described as one methods available from Apache Commons Math. – Grzegorz Szpetkowski Apr 02 '15 at 16:22
  • It's no problem if you're not a statistics expert. What I'm actually suggesting is that you think about the problem in terms of your problem domain, not "statistically" (i.e., with tests, p-values, etc). To the extent that the data are non-Gaussian, what does it cost you? Why (quantitatively) does it matter that the data are non-Gaussian? You can certainly apply any statistical test you want. But, because the result is divorced from your problem domain, the response to any statement such as "The p-value is significant" must be "So what?" – Robert Dodier Apr 02 '15 at 18:35

0 Answers0