4

Using R 3.2.2, I found a weird behavior running a simple linear interpolation. The first data frame gives the right result :

test<-data.frame(dt=c(36996616, 36996620, 36996623, 36996626), value=c(1,2,3,4))
lm(value~dt, test)$coefficients

  (Intercept)            dt 
-1.114966e+07  3.013699e-01 

By incrementing the dt variable, the coefficient is now NA :

test$dt<-test$dt+1
lm(value~dt, test)$coefficients

(Intercept)          dt 
        2.5          NA 

Any idea why ? Seems there is an overflow somewhere ?

Thanks !

duplode
  • 33,731
  • 7
  • 79
  • 150
Yves
  • 43
  • 2

2 Answers2

5

Edit

I found some better information about this issue.

You can get NA coefficients if the predictors are perfectly correlated. This seems to be an unusual case since we only have one predictor. So in this case, dt appears to be linearly related with the intercept.

We can find linearly dependent variables using alias. See https://stats.stackexchange.com/questions/112442/what-are-aliased-coefficients

In the first example

test<-data.frame(dt=c(36996616, 36996620, 36996623, 36996626), value=c(1,2,3,4))
fit1 <- lm(value ~ dt, test)
alias(fit1)
Model :
value ~ dt

No linearly dependent terms. But in the second example

test$dt <- test$dt + 1
fit2 <- lm(value ~ dt, test)
alias(fit2)
Model :
value ~ dt

Complete :
   [,1]       
dt 147986489/4

Which appears to show a linearly dependent relationship between dt and the intercept.

Additional information on how lm deals with a reduced-rank model: https://stat.ethz.ch/pipermail/r-help/2002-February/018512.html.

lm does not invert X'X https://stat.ethz.ch/pipermail/r-help/2008-January/152456.html, but I still think the below is helpful to show the singularity of X'X.

x <- matrix(c(rep(1, 4), test$dt), ncol=2)
y <- test$value

b <- solve(t(x) %*% x) %*% t(x) %*% y
Error in solve.default(t(x) %*% x) : 
system is computationally singular: reciprocal condition number = 7.35654e-30

The default tol in lm.fit is 1e-7, which is the tolerance for computing linear dependencies in qr decomposition.

qr(t(x) %*% x)$rank
[1] 1

If you reduce this, you will get a parameter estimate for dt.

# decrease tol in qr
qr(t(x) %*% x, tol = 1e-31)$rank
[1] 2

# and in lm
lm(value~dt, test, tol=1e-31)$coefficients
  (Intercept)            dt 
-1.114966e+07  3.013699e-01 

See https://stats.stackexchange.com/questions/86001/simple-linear-regression-fit-manually-via-matrix-equations-does-not-match-lm-o for details on the matrix algebra in simple linear regression.

Community
  • 1
  • 1
Whitebeard
  • 5,945
  • 5
  • 24
  • 31
  • Yes, `biglm` function from `biglm` packages gives the same answer. –  Sep 24 '15 at 02:38
  • Thank you very much for the fast answer, and indeed it works well with the `tol` option. The thing is I got the same issue with any set of values for y, ie `test<-data.frame(dt=c(36996616, 36996620, 36996623, 36996626), value=c(4655145,2,-51434,215))`. In that case, x and y are less correlated but same NA issue. – Yves Sep 24 '15 at 04:18
  • You are welcome. The computational singular error I mentioned in the middle is derived from the design matrix `x`, specifically `solve(t(x) %*% x)`. Run `qr(t(x) %*% x)$rank` and you get `1`, which means a singular matrix. But if you run `qr(t(x) %*% x, tol=1e-31)$rank` you get `2`, which is a non-singular matrix. That said, I am able to get both parameter estimates with your updated data, so I think the issue is a combination of the high correlation and the design matrix, which depends only on `dt` – Whitebeard Sep 24 '15 at 04:37
  • @Yves I did some additional research and think I have some better information. Have updated my answer accordingly – Whitebeard Sep 24 '15 at 10:58
0

Function biglm from biglm seems to directly manage this:

library(biglm)
test <- data.frame(dt=c(36996616, 36996620, 36996623, 36996626), 
                   value=c(1,2,3,4))
test$dt <- test$dt+1

coefficients(biglm(value ~ dt, test))
#   (Intercept)            dt 
# -1.114966e+07  3.013699e-01