2

I am getting different results from lmPerm based on the order in which I enter the variables in the function call.

For example, placing NCF.pf before TotalProperties yields the following:

pfit <- lmp(NetCashOps ~ NCF.pf + TotalProperties, data = sub.pm, subset = Presence == 1)

summary(pfit)
...
Coefficients:
                  Estimate   Iter  Pr(Prob)    
NCF.pf            4.581e-01    51         1    
TotalProperties   5.246e+04  5000    <2e-16 ***

but, when I switch the order of the coefficients in the formula and place TotalProperties before NCF.pf, the p-value on NCF.pf becomes significant

pfit2 <- lmp(NetCashOps ~ TotalProperties + NCF.pf, data = sub.pm, subset = Presence == 1)

summary(pfit2)
...
Coefficients:
                  Estimate   Iter  Pr(Prob)    
TotalProperties   5.246e+04  5000   <2e-16 ***
NCF.pf            4.581e-01  5000   <2e-16 ***

Am I missing something? Why would the p-values be different just because I switched the order of the variables in the function call?

Update - Data Source and lm Output (11/11/2016)

The data can be found on GitHub at this link.

When calling the standard lm function twice (reversing the order of the variables on the second call), the p-values are identical (see below). Hence, unlike when using the lmPerm function, the order of the variables doesn't matter with lm.

fit1 <- lm(NetCashOps ~ NCF.pf + TotalProperties, data = sub.pm, subset = Presence == 1)

summary(fit1)
...
Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     7.088e+05  2.258e+05   3.138   0.0019 ** 
NCF.pf          4.581e-01  1.112e-01   4.121 5.11e-05 ***
TotalProperties 5.246e+04  9.519e+03   5.511 8.76e-08 ***


fit2 <- lm(NetCashOps ~ TotalProperties + NCF.pf, data = sub.pm, subset = Presence == 1)

summary(fit2)
...
Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     7.088e+05  2.258e+05   3.138   0.0019 ** 
TotalProperties 5.246e+04  9.519e+03   5.511 8.76e-08 ***
NCF.pf          4.581e-01  1.112e-01   4.121 5.11e-05 ***

Thanks!

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
Rymatt830
  • 129
  • 9
  • this is a statistical question, not a programming question: voting to close/migrate to CrossValidated. This is a fundamental property of linear models whenever the predictors are not perfectly orthogonal. – Ben Bolker Nov 11 '16 at 03:15
  • Hi Zheyuan. I have provided a link to the data and the lm output, as you requested. Thank you for your help. – Rymatt830 Nov 11 '16 at 18:57

1 Answers1

3

I already saw 2 close votes to migrate this to Cross Validated, but in my humble opinion this should stay on Stack Overflow. It is true, that t-statistic and p-value are not invariant to the specification order of terms, under the non-pivoted QR factorization strategy used by lm and lmp, but as shown in the new edit, for OP's data, these statistic should be invariant. So there must be something sensitive at programming level.

My quick diagnose suggests, that if we set seqs = TRUE, rather than using the default FALSE, we would get consistent result:

## I have subsetted data with `Presence == 1` into a new dataset `dat`
## I have also renamed variable name for simplicity

coef(summary(lmp(y ~ x1 + x2, dat, seqs = TRUE)))

#                Estimate Iter Pr(Prob)
#(Intercept) 2.019959e+06 5000        0
#x1          4.580840e-01 5000        0
#x2          5.245619e+04 5000        0

coef(summary(lmp(y ~ x2 + x1, dat, seqs = TRUE)))
#                Estimate Iter Pr(Prob)
#(Intercept) 2.019959e+06 5000        0
#x2          5.245619e+04 5000        0
#x1          4.580840e-01 5000        0

Note, the Pr(Prob) should be "< 2e-16" when printed by summary, but when using coef to obtain a matrix, those tiny values are 0.

The documentation of ?lmp mentions a little on this part:

The SS will be calculated _sequentially_, just as ‘lm()’ does; or
they may be calculated _uniquely_, which means that the SS for
each source is calculated conditionally on all other sources.

I am at the moment not sure what SS is (as I am not a user of lmPerm), but this sounds like that for consistent result, we should set seqs = TRUE.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • Hello Zheyuan. Thank you your response, and setting `seqs = TRUE` in the function call produces the same results irrespective of variable order and also produces results consistent with that of lm. SS refers to sum of squares, therefore `seqs = TRUE` calculates sequential sum of squares. In response to @BenBolker, whether or not `seqs` should be set equal to TRUE or FALSE is a question for CrossValidated; but, identifying this as an option in the function call in order to produce consistent results is appropriate for Stack Overflow. – Rymatt830 Nov 11 '16 at 20:11