Is there an equivalent function for anova.lm() in Java?

Question

I am comparing two linear models in R with Anova, and I would like to do the same thing in Java. To simplify it, I took the example code from https://stats.stackexchange.com/questions/48854/why-am-i-getting-different-intercept-values-in-r-and-java-for-simple-linear-regr and modified it a bit below. The models are test_trait ~ geno_A + geno_B and test_trait ~ geno_A + geno_B + geno_A:geno_B. The coefficients of the models implemented in R and Java are the same. In R I use anova(fit, fit2) where the fits are the results of lm and in Java I use TestUtils.oneWayAnovaPValue from org.apache.commons.math3.

With R I get a pvalue of 0.797, while with Java I get a pvalue of 0.817, so this is not the right method, but I can not find how to do it correctly. Is there an equivalent of R's anova.lm in Java?

The full code is below.

R

test_trait <- c( -0.48812477 , 0.33458213, -0.52754476, -0.79863471, -0.68544309, -0.12970239,  0.02355622, -0.31890850,0.34725819 , 0.08108851)
geno_A <- c(1, 0, 1, 2, 0, 0, 1, 0, 1, 0)
geno_B <- c(0, 0, 0, 1, 1, 0, 0, 0, 0, 0) 
fit <- lm(test_trait ~ geno_A+geno_B)
fit2 <- lm(test_trait ~ geno_A + geno_B + geno_A:geno_B)

which gives the coefficients

> fit
Call:
lm(formula = test_trait ~ geno_A + geno_B)

Coefficients:
(Intercept)       geno_A       geno_B  
   -0.03233     -0.10479     -0.60492  

> fit2
Call:
lm(formula = test_trait ~ geno_A + geno_B + geno_A:geno_B)

Coefficients:
  (Intercept)         geno_A         geno_B  geno_A:geno_B  
    -0.008235      -0.152979      -0.677208       0.096383

And the Anova

> anova(fit, fit2) # 0.797 
Analysis of Variance Table

Model 1: test_trait ~ geno_A + geno_B
Model 2: test_trait ~ geno_A + geno_B + geno_A:geno_B
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1      7 0.77982                           
2      6 0.77053  1 0.0092897 0.0723  0.797

Java

    double [] y =  {-0.48812477,  0.33458213,  
            -0.52754476, -0.79863471,
            -0.68544309, -0.12970239,
             0.02355622, -0.31890850,
             0.34725819,  0.08108851};
double [][] x = {{1,0}, {0,0},
                 {1,0}, {2,1},
                 {0,1}, {0,0},
                 {1,0}, {0,0},
                 {1,0}, {0,0}};
double [][] xb = {{1,0,0}, {0,0,0},
                  {1,0,0}, {2,1,2},
                  {0,1,0}, {0,0,0},
                  {1,0,0}, {0,0,0},
                  {1,0,0}, {0,0,0}};

OLSMultipleLinearRegression regr = new OLSMultipleLinearRegression();
regr.newSampleData(y, x);
double[] beta = regr.estimateRegressionParameters();   

System.out.printf("First model: y = int + genoA + genoB\n");
System.out.printf("Intercept: %.3f\t", beta[0]);
System.out.printf("beta1: %.3f\t", beta[1]);
System.out.printf("beta2: %.3f\n\n", beta[2]);

regr.newSampleData(y, xb);
double[] betab = regr.estimateRegressionParameters();   

System.out.printf("Second model: y = int + genoA + genoB + genoA:genoB\n");
System.out.printf("Intercept: %.3f\t", betab[0]);
System.out.printf("beta1: %.3f\t", betab[1]);
System.out.printf("beta2: %.3f\t", betab[2]);
System.out.printf("beta2: %.3f\n", betab[3]);

Which gives the same coefficients as in R

First model: y = int + genoA + genoB
Intercept: -0.032   beta1: -0.105   beta2: -0.605

Second model: y = int + genoA + genoB + genoA:genoB
Intercept: -0.008   beta1: -0.153   beta2: -0.677   beta2: 0.096

But the Anova gives a different result

List classes = new ArrayList();
classes.add(beta);
classes.add(betab);
double pvalue = TestUtils.oneWayAnovaPValue(classes);
double fvalue = TestUtils.oneWayAnovaFValue(classes);
System.out.println(pvalue); 
System.out.println(fvalue); 

0.8165390406874127
0.05979444576790511

Does the java version include the interaction term? See http://stats.stackexchange.com/questions/48854/why-am-i-getting-different-intercept-values-in-r-and-java-for-simple-linear-regr — Benjamin, Feb 17 '16 at 02:59
@Benjamin it does include the interaction term, I verify this by comparing the coefficient values. I have edited these in the question — Niek de Klein, Feb 17 '16 at 11:44
@NiekdeKlein Please also provide the F value of the Java code. I have the slight suspicion that Java and R calculate things in a slightly different way. — Joris Meys, Feb 17 '16 at 11:52
@Joris Meys 0.05979444576790511 (with `double pvalue = TestUtils.oneWayAnovaFValue(classes);`) — Niek de Klein, Feb 17 '16 at 11:54

Joris Meys · Accepted Answer · 2016-02-18T09:27:24.180

You're very much misunderstanding ANOVA in the case where you compare two regressions. That's not an ANOVA in the sense of oneWayAnova. The equivalent of onewayAnova in R is the function aov. The function anova on the other hand implements a multitude of tests for comparison of models, and the name anova is confusing to say the least...

If you compare two regression models, you want to do an F test on sums of squares. What you do in your code, is an one-way ANOVA to see if the two sets of regression parameters differ significantly. That's not what you want to do, but that's exactly what your JAVA code is doing.

In order to calculate the correct F test, you need to do the following:

calculate MSE for the largest model by dividing the Residual Sum of Squares (RSS) by the degrees of freedom (df) (in the R table : 0.77053 / 6
calculate the MSEdifference by substracting the RSS of both models (result is "Sum of Sq." in the R table), substracting the df for both models (result is "Df" in the R table), and divide these numbers.
Divide 2 by 1 and you have the F value
calculate the p-value using the F value in 3 and for df the df-difference in the numerator and df of the largest model in the denominator.

As far as I know, the class OLSMultipleLinearRegression doesn't have any convenient methods to extract the number of degrees of freedom, so this is not straight-forward to do in Java. You'll have to calculate the df manually and then use the class FDistribution to calculate the p value.

eg:

OLSMultipleLinearRegression regr = new OLSMultipleLinearRegression();
regr.newSampleData(y, x);
double SSR1 = regr.calculateResidualSumOfSquares();
double df1 = y.length - (x[0].length + 1); 
    //df = n - number of coefficients, including intercept

regr.newSampleData(y, xb);
double SSR2 = regr.calculateResidualSumOfSquares();
double df2 = y.length - (xb[0].length + 1);

double MSE = SSR2/df2; // EDIT: You need the biggest model here!
double MSEdiff = Math.abs ((SSR2 - SSR1) / (df2 - df1));
double dfdiff = Math.abs(df2 - df1);

double Fval = MSEdiff / MSE;

FDistribution Fdist = new FDistribution(dfdiff, df2);
double pval = 1 - Fdist.cumulativeProbability(Fval);

Now both the F value and the p value should be exactly what you see in the anova() table of R. df1 and df2 are the column Res.Df in the R table, the difference should be Df in the R table, and MSEdiff should be the same as Sum of Sq. divided by Df from the R table.

Disclaimer: I'm a bad JAVA programmer, so code above is more conceptual than actual code. Please look for typos or stupid mistakes and check the documentation of the FDistribution class I used here :

https://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/distribution/FDistribution.html#cumulativeProbability%28double%29

And now you know why statisticians use R instead of Java ;-)

EDIT : The FDistribution used in the code above is the class

org.apache.commons.math3.distribution.FDistribution

There's also an FDistribution in JSci :

JSci.maths.statistics.FDistribution

If you use that one, the last part of the code becomes:

FDistribution Fdist = new FDistribution(dfdiff, df2);
double pval = 1 - Fdist.cumulative(Fval);

Depending on the exact implementation, the cumulative probabilities might differ slightly. I have alas no idea about the difference and/or which one can be trusted better.

Thanks a lot! The sum of squares is the same as in R but the Fvalue is not (java: 0.782474, R: 0.797). — Niek de Klein, Feb 17 '16 at 16:16
@NiekdeKlein You mean the F value (0.723) or the p value? (0.797). I might have made a mistake with the lengths in calculating the degrees of freedom, but I can't test here at home. — Joris Meys, Feb 17 '16 at 17:33
@NiekdeKlein `df2` should be 6 and `df1` should be 7, and `Fval` should be 0.723. If those are correct, then the difference lies in the implementation of the F distribution in Java. If not, you might need `x.length` and `xb.length` instead of `x[0].length` and `xb[0].length`. I always mix up with those arrays of arrays... — Joris Meys, Feb 17 '16 at 18:49
Thanks for the explanation, the df1 and df2 are correct but the pval from your edit is the same as the one before, so I will have to dig into the F distribution — Niek de Klein, Feb 18 '16 at 06:33
@NiekdeKlein My mistake! I've used the smaller model instead of the larger model to calculate `MSE`. Sorry about that. I've edited and now it should be correct.µ — Joris Meys, Feb 18 '16 at 09:26

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

The problem is that the methods you're comparing are not doing the same.

anova() in R actually performs a Likelihood-Ratio test to check if your second model significantly improves by adding a new variable: more info in the answer here

oneWayAnovaPValue() in java on the other hand simply runs a t-test to check if the difference in means between groups is significant. What you've done in this case is comparing if the mean of your first set of coefficients is significantly different from the second set, which is irrelevant.

As far as I know, there is no function readily available in java to readily perform a Likelihood-Ratio test. But you can create one quite easily. In R you can do the following

anova(fit, fit2,test="Chisq")
#p: 0.788

#or manually:
df.diff = fit$df.residual - fit2$df.residual
vals <- (sum(residuals(fit)^2) - sum(residuals(fit2)^2))/sum(residuals(fit2)^2) * fit2$df.residual 
pchisq(vals, df.diff, lower.tail = FALSE)
#p: 0.7879634

so you can take the same approach in java. A short search on google gives me an implementation of pchisq in java here (take note that the lower.tail=FALSE command is the same as 1-pchisq(lower.tail=TRUE), so we don't really need that option).

This allows us to do the following

public void regressionRun(){
OLSMultipleLinearRegression regr = new OLSMultipleLinearRegression();
OLSMultipleLinearRegression regr2 = new OLSMultipleLinearRegression();

double[] y = new double[] { -0.48812477, 0.33458213, -0.52754476,
        -0.79863471, -0.68544309, -0.12970239, 0.02355622, -0.31890850,
        0.34725819, 0.08108851 };
double[][] x = new double[10][];
double[][] x2 = new double[10][];
x[0] = new double[] { 1, 0 };
x[1] = new double[] { 0, 0 };
x[2] = new double[] { 1, 0 };
x[3] = new double[] { 2, 1 };
x[4] = new double[] { 0, 1 };
x[5] = new double[] { 0, 0 };
x[6] = new double[] { 1, 0 };
x[7] = new double[] { 0, 0 };
x[8] = new double[] { 1, 0 };
x[9] = new double[] { 0, 0 };
//
x2[0] = new double[] { 1, 0, 0 };
x2[1] = new double[] { 0, 0, 0 };
x2[2] = new double[] { 1, 0, 0 };
x2[3] = new double[] { 2, 1, 2 };
x2[4] = new double[] { 0, 1, 0 };
x2[5] = new double[] { 0, 0, 0 };
x2[6] = new double[] { 1, 0, 0 };
x2[7] = new double[] { 0, 0, 0 };
x2[8] = new double[] { 1, 0, 0 };
x2[9] = new double[] { 0, 0, 0 };

regr.newSampleData(y, x);
double[] b = regr.estimateResiduals();

regr2.newSampleData(y, x2);
double[] b2 = regr2.estimateResiduals();

//calculate sum of squares
double sumsq_b = 0;
double sumsq_b2 = 0;
for (double res : b){
    sumsq_b += res**2;
}
for (double res : b2){
    sumsq_b2 += res**2;
}
//calculate degrees of freedom
int df_b = y.length-(x[0].length+1);
int df_b2 = y.length-(x2[0].length+1);

double vals = (sumsq_b-sumsq_b2)/sumsq_b2*df_b2;

double pvalue = 1-pchisq(vals,df_b-df_b2);
System.out.println(pvalue);
}
//0.7879633810167291

In the case of `lm` R uses an F test by default, not a Chi-Square test. The F test is more appropriate, and you can see that by comparing the p value you get with the one returned by `anova`. see also `?anova.lm` — Joris Meys, Feb 17 '16 at 13:37
If you mean "calculate it by hand because JAVA sucks for statistics" then absolutely :-) I've added some example code showing how I'd do it using the proper distribution classes in JAVA. — Joris Meys, Feb 17 '16 at 13:41

Is there an equivalent function for anova.lm() in Java?

2 Answers2

Linked