I have googled this and could not find a solution.
It seems R has an issue with AIC/BIC calculation. It produces incorrect results. A simple example is shown below:
link = 'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv'
df = read.csv(link, row.names = 'model')
form = 'mpg ~ disp + hp + wt + qsec + gear'
my_model = lm(form, data = df)
summary(my_model)
cat('AIC:',AIC(my_model),'\tBIC:',AIC(my_model, k = log(nrow(df))))
AIC: 157.4512 BIC: 167.7113
Doing Exactly the same thing in python, I obtain:
import pandas as pd
from statsmodels.formula.api import ols as lm
link = 'https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv'
df = pd.read_csv(link, index_col='model')
form = 'mpg ~ disp + hp + wt + qsec + gear'
my_model = lm(form, df).fit()
my_model.summary()
print(f'AIC: {my_model.aic:.4f}\tBIC: {my_model.bic:.4f}')
AIC: 155.4512 BIC: 164.2456
You could check the summary(my_model)
in R and my_model.summary()
in python and you will notice that the two models are EXACTLY the same in everything, apart from the AIC and BIC.
I decided to compute it manually in R:
p = length(coef(my_model)) # number of predictors INCLUDING the Intercept ie 6
s = sqrt(sum(resid(my_model)^2)/nrow(df)) #sqrt(sigma(my_model)^2 * (nrow(df) - p)/nrow(df))
logl = -2* sum(dnorm(df$mpg, fitted(my_model),s, log = TRUE))
c(aic = logl + 2*p, bic = logl + log(nrow(df))*p)
aic bic
155.4512 164.2456
Which matches the results produced by python.
Digging deeper, I noticed that the AIC does use the logLik
function. And that is where the problem arises: logLik(my_model)
gives exactly the same results as shown in the logl
above before multiplying by -2
but the df
is given as 7 instead of 6.
If I bruteforce the rank in order to make it 6, I get the correct results ie:
my_model$rank = my_model$rank - 1
cat('AIC:',AIC(my_model),'\tBIC:',AIC(my_model, k = log(nrow(df))))
AIC: 155.4512 BIC: 164.2456
Why does R add 1 to the number of predictors? You can access the logLik
function used in base R by typing stats:::logLik.lm
on your Rstudio and pressing enter. The two lines below kind of seems to have an issue:
function (object, REML = FALSE, ...)
{
...
p <- object$rank
...
attr(val, "df") <- p + 1 # This line here. Why does R ADD 1?
...
}