40

Could someone explain to the statistically naive what the difference between Multiple R-squared and Adjusted R-squared is? I am doing a single-variate regression analysis as follows:

 v.lm <- lm(epm ~ n_days, data=v)
 print(summary(v.lm))

Results:

Call:
lm(formula = epm ~ n_days, data = v)

Residuals:
    Min      1Q  Median      3Q     Max 
-693.59 -325.79   53.34  302.46  964.95 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2550.39      92.15  27.677   <2e-16 ***
n_days        -13.12       5.39  -2.433   0.0216 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 410.1 on 28 degrees of freedom
Multiple R-squared: 0.1746,     Adjusted R-squared: 0.1451 
F-statistic: 5.921 on 1 and 28 DF,  p-value: 0.0216 
Tomas
  • 57,621
  • 49
  • 238
  • 373
fmark
  • 57,259
  • 27
  • 100
  • 107

4 Answers4

61

The "adjustment" in adjusted R-squared is related to the number of variables and the number of observations.

If you keep adding variables (predictors) to your model, R-squared will improve - that is, the predictors will appear to explain the variance - but some of that improvement may be due to chance alone. So adjusted R-squared tries to correct for this, by taking into account the ratio (N-1)/(N-k-1) where N = number of observations and k = number of variables (predictors).

It's probably not a concern in your case, since you have a single variate.

Some references:

  1. How high, R-squared?
  2. Goodness of fit statistics
  3. Multiple regression
  4. Re: What is "Adjusted R^2" in Multiple Regression
riQQ
  • 9,878
  • 7
  • 49
  • 66
neilfws
  • 32,751
  • 5
  • 50
  • 63
8

The R-squared is not dependent on the number of variables in the model. The adjusted R-squared is.

The adjusted R-squared adds a penalty for adding variables to the model that are uncorrelated with the variable your trying to explain. You can use it to test if a variable is relevant to the thing your trying to explain.

Adjusted R-squared is R-squared with some divisions added to make it dependent on the number of variables in the model.

Jay
  • 9,314
  • 7
  • 33
  • 40
  • Note: Adding a predictor to a regression will almost always increase r-squared, even if only by a little bit due to random sampling. – Jeromy Anglim May 20 '10 at 13:36
  • ty Jeromy, I meant to say "go down" instead of go up. The R-squared will never fall as a result of adding a new variable to the model. The adjusted R-squared can go up or down if a new variable is added. It was a bad example, so I removed it. – Jay May 20 '10 at 17:20
8

The Adjusted R-squared is close to, but different from, the value of R2. Instead of being based on the explained sum of squares SSR and the total sum of squares SSY, it is based on the overall variance (a quantity we do not typically calculate), s2T = SSY/(n - 1) and the error variance MSE (from the ANOVA table) and is worked out like this: adjusted R-squared = (s2T - MSE) / s2T.

This approach provides a better basis for judging the improvement in a fit due to adding an explanatory variable, but it does not have the simple summarizing interpretation that R2 has.

If I haven't made a mistake, you should verify the values of adjusted R-squared and R-squared as follows:

s2T <- sum(anova(v.lm)[[2]]) / sum(anova(v.lm)[[1]])
MSE <- anova(v.lm)[[3]][2]
adj.R2 <- (s2T - MSE) / s2T

On the other side, R2 is: SSR/SSY, where SSR = SSY - SSE

attach(v)
SSE <- deviance(v.lm) # or SSE <- sum((epm - predict(v.lm,list(n_days)))^2)
SSY <- deviance(lm(epm ~ 1)) # or SSY <- sum((epm-mean(epm))^2)
SSR <- (SSY - SSE) # or SSR <- sum((predict(v.lm,list(n_days)) - mean(epm))^2)
R2 <- SSR / SSY 
gd047
  • 29,749
  • 18
  • 107
  • 146
  • There is a typo in the last code box: The `deviance(v.lm)` call will actually output the model `SSR`, which in turn means that `SSE <- (SSY - SSR)`. As for the `SSY`, a simpler way to retrieve it without having to refit the model would be: `SSY <- sum(anova(v.lm)$"Sum Sq")`. – landroni May 12 '16 at 10:48
  • Actually what I meant is that using `SSR` for explained SS was counterintuitive, and that `SSR` more readily denotes residual SS, whereas `SSE` the explained SS... – landroni May 12 '16 at 11:49
  • SSR is the Sum of Squares due to Regression. Residual Rum of Rquares is "RSS" https://en.wikipedia.org/wiki/Explained_sum_of_squares – gd047 May 12 '16 at 13:33
  • Damn those conventions! The book I have at hand (Wooldridge, 2009) uses SSR, SSE, SST for residual, explained, total SS, respectively. I guess when using these ambiguous conventions a note on their intended meaning would be handy... Wiki also defines SSR as "sum of squared residuals": https://en.wikipedia.org/wiki/Residual_sum_of_squares . From what I see RSS, ESS, and TSS are the least confusing notations. – landroni May 12 '16 at 13:59
2

Note that, in addition to number of predictive variables, the Adjusted R-squared formula above also adjusts for sample size. A small sample will give a deceptively large R-squared.

Ping Yin & Xitao Fan, J. of Experimental Education 69(2): 203-224, "Estimating R-squared shrinkage in multiple regression", compares different methods for adjusting r-squared and concludes that the commonly-used ones quoted above are not good. They recommend the Olkin & Pratt formula.

However, I've seen some indication that population size has a much larger effect than any of these formulas indicate. I am not convinced that any of these formulas are good enough to allow you to compare regressions done with very different sample sizes (e.g., 2,000 vs. 200,000 samples; the standard formulas would make almost no sample-size-based adjustment). I would do some cross-validation to check the r-squared on each sample.

Phil Goetz
  • 549
  • 4
  • 14