I am newbie in statistics topic, so I guess it might be something obvious that I am missing here.
Basically I would like to examine if some double
array of integer values (histogram) conforms to Normal distribution (mean and standard deviation is specified) with some significance level, basing on Statistical tests from Apache Commons Math.
What I already understand is that common way is to calculate p-value and then decide if null hypothesis is true or not.
My first "baby" step is to check if two arrays are coming from the same distribution using One-Way ANOVA test (second part is taken from example in documentation):
double samples1[] = new double[100];
double samples2[] = new double[100];
Random rand = new Random();
for (int i = 0; i < 100000; i++) {
int index1 = (int) (rand.nextGaussian()*5 + 50);
int index2 = (int) (rand.nextGaussian()*5 + 50);
try {
samples1[index1-1]++;
}
catch (ArrayIndexOutOfBoundsException e) {}
try {
samples2[index2-1]++;
}
catch (ArrayIndexOutOfBoundsException e) {}
}
List classes = new ArrayList<>();
classes.add(samples1);
classes.add(samples2);
double pvalue = TestUtils.oneWayAnovaPValue(classes);
boolean fail = TestUtils.oneWayAnovaTest(classes, 0.05);
System.out.println(pvalue);
System.out.println(fail);
The result is:
1.0
false
Assuming that significance level is 0.05 I can deduce that hypothesis is true (i.e. both arrays are from the same distribution) as p > 0.05
.
Now let's take Kolmogorov-Smirnov test. Example code in documentation shows how to check single array against some NormalDistribution
object (that is my goal). However it also allows to check two arrays. I cannot get proper result in both cases. For example let's adapt above example into K-S:
double samples1[] = new double[100];
double samples2[] = new double[100];
Random rand = new Random();
for (int i = 0; i < 100000; i++) {
int index1 = (int) (rand.nextGaussian()*5 + 50);
int index2 = (int) (rand.nextGaussian()*5 + 50);
try {
samples1[index1-1]++;
}
catch (ArrayIndexOutOfBoundsException e) {}
try {
samples2[index2-1]++;
}
catch (ArrayIndexOutOfBoundsException e) {}
}
double pvalue = TestUtils.kolmogorovSmirnovTest(samples1, samples2);
boolean fail = pvalue < 0.05;
System.out.println(pvalue);
System.out.println(fail);
Result is:
7.475142727031425E-11
true
My question is why p-value of essentially the same data is now so small? Does it mean it that this test is not suited for such type of data?
Should I:
- Generate reference array of
NormalDistribution
(that is, with specified mean and standard devition) and then compare it to my array using One-Way ANOVA test (or other) - Somehow adapt my data and then use K-S compare single array against
NormalDistribution
object
?