logLik.lm(): Why does R use (p + 1) instead of p for degree of freedom?

Question

I am trying to understand the results from AIC/BIC in R. For some reason R adds 1 to the number of parameters to be estimated. Hence R uses a different formula than 2 * p - 2 * logLik (in Gaussian case logLik is residual sum of squares). In fact it uses: 2 * (p + 1) - 2 * logLik.

After a research, I found the problem is related to stats:::logLik.lm().

> stats:::logLik.lm  ## truncated R function body
## ...
##     attr(val, "df") <- p + 1
## ...

As a real example (using R's built-in dataset trees), consider:

x <- lm(Height ~ Girth, trees)  ## a model with 2 parameters
logLik(x)
## 'log Lik.' -96.01663 (df=3)

This is really puzzling. Anyone knows why?

Edit1: glm examples by @crayfish44

model.g <- glm(dist ~ speed, cars, family=gaussian)
logLik(model.g) # df=3
model.p <- glm(dist ~ speed, cars, family=poisson)
logLik(model.p) #df=2
model.G <- glm(dist ~ speed, cars, family=Gamma)
logLik(model.G) #df=3

Edit2: methods of logLik

> methods(logLik)
[1] logLik.Arima*  logLik.glm*  logLik.lm*  logLik.logLik* logLik.nls*

Right. And it returns: -151.14 (df=4). Which is the right value. The problem is in the dof. There are 4 dof's when clearly there are only 3 parameters to be estimated: one constant and two slope parameters. — Vincent, Jun 20 '16 at 08:30
`model.g <- glm(dist ~ speed, cars, family=gaussian)` `model.p <- glm(dist ~ speed, cars, family=poisson)` `model.G <- glm(dist ~ speed, cars, family=Gamma)` `logLik(model.g) #df=3` `logLik(model.p) #df=2` `logLik(model.G) #df=3` I guess +1 is sd (family=gaussian), Dispersion parameter (family=Gamma). — cuttlefish44, Jun 20 '16 at 12:42

score 6 · Accepted Answer · answered Jun 21 '16 at 00:03

We were actually really close to the answer, when we decided to inspect stats:::logLik.lm. Had we further inspect stats:::logLik.glm (Thanks to the glm example by @crayfish44: Mate, you are awesome. Once again you give me inspiration, since the last post regarding persp() and trans3d(). Thanks!), we would have solved the issue.

The pitfall of using :::, is that we are unable to view comments for the code. So I've decided to check the source file of R-3.3.0. You can open the file R-3.3.0/src/library/stats/R/logLik.R to view commented code for generic functions logLik.**.

## log-likelihood for glm objects
logLik.glm <- function(object, ...)
{
    if(!missing(...)) warning("extra arguments discarded")
    fam <- family(object)$family
    p <- object$rank
    ## allow for estimated dispersion
    if(fam %in% c("gaussian", "Gamma", "inverse.gaussian")) p <- p + 1
    val <- p - object$aic / 2
    ## Note: zero prior weights have NA working residuals.
    attr(val, "nobs") <- sum(!is.na(object$residuals))
    attr(val, "df") <- p
    class(val) <- "logLik"
    val
 }

Note the lines:

p <- object$rank
## allow for estimated dispersion
if(fam %in% c("gaussian", "Gamma", "inverse.gaussian")) p <- p + 1

p is the effect number of model coefficients after rank-detection.

When we have "gaussian()", "Gamma()" and "inverse.gaussian()" response, the degree of freedom is added 1, as we need the estimation of dispersion parameter of exponential distribution.
For "binomial()" and "poisson()" response, the dispersion parameter is known to be 1, hence needs not be estimated.

Maybe ?logLik should consider explaining this, in case there are some ones as foolish as we are!

So the plus 1 comes from estimation of the standart deviatio? I was suspectin something like this. — Vincent, Jun 21 '16 at 15:15
FWIW `?logLik.lm` does have a (slightly oblique) note: 'For ‘"lm"’ fits it is assumed that the scale has been estimated (by maximum likelihood or REML)' — Ben Bolker, Apr 17 '21 at 01:57

logLik.lm(): Why does R use (p + 1) instead of p for degree of freedom?

1 Answers1

Linked