Why does summary overestimate the R-squared with a "no-intercept" model formula

Question

I wanted to make a simple linear model (lm()) without intercept coefficient so I put -1 in my model formula as in the following example. The problem is that the R-squared return by summary(myModel) seems to be overestimated. lm(), summary() and -1 are among the very classic function/functionality in R. Hence I am a bit surprised and I wonder if this is a bug or if there is any reason for this behaviour.

Here is an example:

x <- rnorm(1000, 3, 1)
mydf <- data.frame(x=x, y=1+x+rnorm(1000, 0, 1))
plot(y ~ x, mydf, xlim=c(-2, 10), ylim=c(-2, 10))

mylm1 <- lm(y ~ x, mydf)
mylm2 <- lm(y ~ x - 1, mydf)

abline(mylm1, col="blue") ; abline(mylm2, col="red")
abline(h=0, lty=2) ; abline(v=0, lty=2)

r2.1 <- 1 - var(residuals(mylm1))/var(mydf$y)
r2.2 <- 1 - var(residuals(mylm2))/var(mydf$y)
r2 <- c(paste0("Intercept - r2: ", format(summary(mylm1)$r.squared, digits=4)),
        paste0("Intercept - manual r2: ", format(r2.1, digits=4)),
        paste0("No intercept - r2: ", format(summary(mylm2)$r.squared, digits=4)),
        paste0("No intercept - manual r2: ", format(r2.2, digits=4)))
legend('bottomright', legend=r2, col=c(4,4,2,2), lty=1, cex=0.6)

enter image description here

This is really a statistical understanding question; I flagged it to be migrated to [Cross Validated](http://stats.stackexchange.com/)--ie, stats.SE (*please don't cross-post, however*). In the interim, you will want to read this CV post: [removal of statistically significant intercept term boosts r2 in linear model](http://stats.stackexchange.com/questions/26176/), particularly the excellent answer by @cardinal. — gung - Reinstate Monica, Dec 02 '13 at 17:01
What R is doing in this case is so non-obvious to many that it warrants its own FAQ [entry](http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-does-summary_0028_0029-report-strange-results-for-the-R_005e2-estimate-when-I-fit-a-linear-model-with-no-intercept_003f). You might want to check that out and confirm you are using the same approach to compare. — Gavin Simpson, Dec 02 '13 at 17:04
@Gavin I am glad I'm not the only one who finds this R behaviour strange, thanks for the link! I myself found out the hard way, but at least I understand *what* happens (unfortunatelly not *why* :-)), see my answer — Tomas, Dec 02 '13 at 17:25
Note that not only R^2 is strange but also F-statistic and its p-value. — Tomas, Dec 02 '13 at 20:22
Thanks everyone for sharing your experience and your knowledge. In particular to gung, Gavin Simpson, Dirk Eddelbuettel and Tomas for your comments and answer that saved my time. I am so used to find answers to my R programming questions on this website that I forgot to check out Cross validated and R FAQ websites. Anyway I am definitely clear about this issue now. — SESman, Dec 04 '13 at 12:16

Tomas · Answer 1 · 2013-12-02T20:20:10.100

11

Oh yeah, I fell into this trap too! Very good question!! It is because

enter image description here

and

enter image description here

in case of model with intercept (your mylm1), the y̅ is mean(y_i) - this is what you expect, this is the SS_tot you basicly want for proper R²
whereas in case of model without intercept, the y̅ is taken as 0 - so the SS_tot will be very high, so the R² will be very close to 1! SS_res will differ according to the worse fit (will be little higher without intercept), but not much.

Code:

attach(mylm1) # in general be careful with attach, here only for code clarity

y_fit <- mylm1$fitted.values
SSE <- sum((y_fit - y)^2)
SST <- sum((y - mean(y))^2)
1-SSE/SST  # R^2 with intercept

y_fit2 <- mylm2$fitted.values
SSE2 <- sum((y_fit2 - y)^2) # SSE2 only slightly higher than SSE..
SST2 <- sum((y - 0)^2)  # !!! the key difference is here !!!
1-SSE2/SST2 # R^2 without intercept

Note: It is not clear to me why in the model without intercept the y̅ is 0 and not mean(y_i), but that's how it is. I myself found out hard way by investigating and hacking with the above code..

edited Dec 02 '13 at 20:20

answered Dec 02 '13 at 17:09

Tomas

57,621
49
238
373

3

Why the model without intercept compares with zero instead of the mean is explained by the last sentence of the [FAQ](http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-does-summary_0028_0029-report-strange-results-for-the-R_005e2-estimate-when-I-fit-a-linear-model-with-no-intercept_003f): "... if you know a priori that E[Y]=0 when x=0 then the ‘null’ model that you should compare to the fitted line, the model where x doesn’t explain any of the variance, is the model where E[Y]=0 everywhere." – Aaron left Stack Overflow Dec 02 '13 at 20:28
1

This is the Gavin's link, good to know the official answer. The argumentation is controversial though: they shouldn't implicitly *assume* E[Y]=0. There are legitimate cases when E[Y]<>0 and you don't want the intercept (e.g. to receive "reasonable" coefficients for all categorical covariates, see [example 1 in this answer](http://stats.stackexchange.com/a/32518/5509)). – Tomas Dec 02 '13 at 20:46
I suppose, so, yes. But that's a situation where the coding doesn't match the intent. That is, even though there's no intercept term in the coding of the formula, you want to compare against a model with an intercept term. – Aaron left Stack Overflow Dec 02 '13 at 21:16

score 2 · Accepted Answer · answered Dec 02 '13 at 17:00

Your formula is wrong. Here is what I do in fastLm.R in RcppArmadillo when computing the summary:

## cf src/library/stats/R/lm.R and case with no weights and an intercept 
f <- object$fitted.values    
r <- object$residuals         
mss <- if (object$intercept) sum((f - mean(f))^2) else sum(f^2)     
rss <- sum(r^2)   

r.squared <- mss/(mss + rss) 
df.int <- if (object$intercept) 1L else 0L    

n <- length(f)     
rdf <- object$df     
adj.r.squared <- 1 - (1 - r.squared) * ((n - df.int)/rdf)

There are two places where you need to keep track of whether there is an intercept or not.

Why does summary overestimate the R-squared with a "no-intercept" model formula

2 Answers2

Linked

Related