4

I am trying to understand the results from AIC/BIC in R. For some reason R adds 1 to the number of parameters to be estimated. Hence R uses a different formula than 2 * p - 2 * logLik (in Gaussian case logLik is residual sum of squares). In fact it uses: 2 * (p + 1) - 2 * logLik.

After a research, I found the problem is related to stats:::logLik.lm().

> stats:::logLik.lm  ## truncated R function body
## ...
##     attr(val, "df") <- p + 1
## ...

As a real example (using R's built-in dataset trees), consider:

x <- lm(Height ~ Girth, trees)  ## a model with 2 parameters
logLik(x)
## 'log Lik.' -96.01663 (df=3)

This is really puzzling. Anyone knows why?


Edit1: glm examples by @crayfish44

model.g <- glm(dist ~ speed, cars, family=gaussian)
logLik(model.g) # df=3
model.p <- glm(dist ~ speed, cars, family=poisson)
logLik(model.p) #df=2
model.G <- glm(dist ~ speed, cars, family=Gamma)
logLik(model.G) #df=3

Edit2: methods of logLik

> methods(logLik)
[1] logLik.Arima*  logLik.glm*  logLik.lm*  logLik.logLik* logLik.nls*
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
Vincent
  • 1,361
  • 2
  • 20
  • 33
  • Yes exactly. A linear model. – Vincent Jun 20 '16 at 08:26
  • Right. And it returns: -151.14 (df=4). Which is the right value. The problem is in the dof. There are 4 dof's when clearly there are only 3 parameters to be estimated: one constant and two slope parameters. – Vincent Jun 20 '16 at 08:30
  • I gave it the same name =) – Vincent Jun 20 '16 at 08:31
  • Right thx for the changes. Looks much more readable. – Vincent Jun 20 '16 at 09:03
  • `model.g <- glm(dist ~ speed, cars, family=gaussian)` `model.p <- glm(dist ~ speed, cars, family=poisson)` `model.G <- glm(dist ~ speed, cars, family=Gamma)` `logLik(model.g) #df=3` `logLik(model.p) #df=2` `logLik(model.G) #df=3` I guess +1 is sd (family=gaussian), Dispersion parameter (family=Gamma). – cuttlefish44 Jun 20 '16 at 12:42

1 Answers1

6

We were actually really close to the answer, when we decided to inspect stats:::logLik.lm. Had we further inspect stats:::logLik.glm (Thanks to the glm example by @crayfish44: Mate, you are awesome. Once again you give me inspiration, since the last post regarding persp() and trans3d(). Thanks!), we would have solved the issue.

The pitfall of using :::, is that we are unable to view comments for the code. So I've decided to check the source file of R-3.3.0. You can open the file R-3.3.0/src/library/stats/R/logLik.R to view commented code for generic functions logLik.**.

## log-likelihood for glm objects
logLik.glm <- function(object, ...)
{
    if(!missing(...)) warning("extra arguments discarded")
    fam <- family(object)$family
    p <- object$rank
    ## allow for estimated dispersion
    if(fam %in% c("gaussian", "Gamma", "inverse.gaussian")) p <- p + 1
    val <- p - object$aic / 2
    ## Note: zero prior weights have NA working residuals.
    attr(val, "nobs") <- sum(!is.na(object$residuals))
    attr(val, "df") <- p
    class(val) <- "logLik"
    val
 }

Note the lines:

p <- object$rank
## allow for estimated dispersion
if(fam %in% c("gaussian", "Gamma", "inverse.gaussian")) p <- p + 1

p is the effect number of model coefficients after rank-detection.

  • When we have "gaussian()", "Gamma()" and "inverse.gaussian()" response, the degree of freedom is added 1, as we need the estimation of dispersion parameter of exponential distribution.
  • For "binomial()" and "poisson()" response, the dispersion parameter is known to be 1, hence needs not be estimated.

Maybe ?logLik should consider explaining this, in case there are some ones as foolish as we are!

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248