5

I wanted to make a simple linear model (lm()) without intercept coefficient so I put -1 in my model formula as in the following example. The problem is that the R-squared return by summary(myModel) seems to be overestimated. lm(), summary() and -1 are among the very classic function/functionality in R. Hence I am a bit surprised and I wonder if this is a bug or if there is any reason for this behaviour.

Here is an example:

x <- rnorm(1000, 3, 1)
mydf <- data.frame(x=x, y=1+x+rnorm(1000, 0, 1))
plot(y ~ x, mydf, xlim=c(-2, 10), ylim=c(-2, 10))

mylm1 <- lm(y ~ x, mydf)
mylm2 <- lm(y ~ x - 1, mydf)

abline(mylm1, col="blue") ; abline(mylm2, col="red")
abline(h=0, lty=2) ; abline(v=0, lty=2)

r2.1 <- 1 - var(residuals(mylm1))/var(mydf$y)
r2.2 <- 1 - var(residuals(mylm2))/var(mydf$y)
r2 <- c(paste0("Intercept - r2: ", format(summary(mylm1)$r.squared, digits=4)),
        paste0("Intercept - manual r2: ", format(r2.1, digits=4)),
        paste0("No intercept - r2: ", format(summary(mylm2)$r.squared, digits=4)),
        paste0("No intercept - manual r2: ", format(r2.2, digits=4)))
legend('bottomright', legend=r2, col=c(4,4,2,2), lty=1, cex=0.6)

enter image description here

SESman
  • 238
  • 2
  • 9
  • 1
    This is really a statistical understanding question; I flagged it to be migrated to [Cross Validated](http://stats.stackexchange.com/)--ie, stats.SE (*please don't cross-post, however*). In the interim, you will want to read this CV post: [removal of statistically significant intercept term boosts r2 in linear model](http://stats.stackexchange.com/questions/26176/), particularly the excellent answer by @cardinal. – gung - Reinstate Monica Dec 02 '13 at 17:01
  • 3
    What R is doing in this case is so non-obvious to many that it warrants its own FAQ [entry](http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-does-summary_0028_0029-report-strange-results-for-the-R_005e2-estimate-when-I-fit-a-linear-model-with-no-intercept_003f). You might want to check that out and confirm you are using the same approach to compare. – Gavin Simpson Dec 02 '13 at 17:04
  • @Gavin I am glad I'm not the only one who finds this R behaviour strange, thanks for the link! I myself found out the hard way, but at least I understand *what* happens (unfortunatelly not *why* :-)), see my answer – Tomas Dec 02 '13 at 17:25
  • Note that not only R^2 is strange but also F-statistic and its p-value. – Tomas Dec 02 '13 at 20:22
  • Thanks everyone for sharing your experience and your knowledge. In particular to gung, Gavin Simpson, Dirk Eddelbuettel and Tomas for your comments and answer that saved my time. I am so used to find answers to my R programming questions on this website that I forgot to check out Cross validated and R FAQ websites. Anyway I am definitely clear about this issue now. – SESman Dec 04 '13 at 12:16

2 Answers2

11

Oh yeah, I fell into this trap too! Very good question!! It is because

enter image description here

and

enter image description here

  • in case of model with intercept (your mylm1), the y̅ is mean(yi) - this is what you expect, this is the SStot you basicly want for proper R2
  • whereas in case of model without intercept, the y̅ is taken as 0 - so the SStot will be very high, so the R2 will be very close to 1! SSres will differ according to the worse fit (will be little higher without intercept), but not much.

Code:

attach(mylm1) # in general be careful with attach, here only for code clarity

y_fit <- mylm1$fitted.values
SSE <- sum((y_fit - y)^2)
SST <- sum((y - mean(y))^2)
1-SSE/SST  # R^2 with intercept

y_fit2 <- mylm2$fitted.values
SSE2 <- sum((y_fit2 - y)^2) # SSE2 only slightly higher than SSE..
SST2 <- sum((y - 0)^2)  # !!! the key difference is here !!!
1-SSE2/SST2 # R^2 without intercept

Note: It is not clear to me why in the model without intercept the y̅ is 0 and not mean(yi), but that's how it is. I myself found out hard way by investigating and hacking with the above code..

Tomas
  • 57,621
  • 49
  • 238
  • 373
  • 3
    Why the model without intercept compares with zero instead of the mean is explained by the last sentence of the [FAQ](http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-does-summary_0028_0029-report-strange-results-for-the-R_005e2-estimate-when-I-fit-a-linear-model-with-no-intercept_003f): "... if you know a priori that E[Y]=0 when x=0 then the ‘null’ model that you should compare to the fitted line, the model where x doesn’t explain any of the variance, is the model where E[Y]=0 everywhere." – Aaron left Stack Overflow Dec 02 '13 at 20:28
  • 1
    This is the Gavin's link, good to know the official answer. The argumentation is controversial though: they shouldn't implicitly *assume* E[Y]=0. There are legitimate cases when E[Y]<>0 and you don't want the intercept (e.g. to receive "reasonable" coefficients for all categorical covariates, see [example 1 in this answer](http://stats.stackexchange.com/a/32518/5509)). – Tomas Dec 02 '13 at 20:46
  • I suppose, so, yes. But that's a situation where the coding doesn't match the intent. That is, even though there's no intercept term in the coding of the formula, you want to compare against a model with an intercept term. – Aaron left Stack Overflow Dec 02 '13 at 21:16
2

Your formula is wrong. Here is what I do in fastLm.R in RcppArmadillo when computing the summary:

## cf src/library/stats/R/lm.R and case with no weights and an intercept 
f <- object$fitted.values    
r <- object$residuals         
mss <- if (object$intercept) sum((f - mean(f))^2) else sum(f^2)     
rss <- sum(r^2)   

r.squared <- mss/(mss + rss) 
df.int <- if (object$intercept) 1L else 0L    

n <- length(f)     
rdf <- object$df     
adj.r.squared <- 1 - (1 - r.squared) * ((n - df.int)/rdf)      

There are two places where you need to keep track of whether there is an intercept or not.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725