4

I fitted two regression models, One with only 1 predictor and the another with 3 predictors. Now I want to compare these two models. How can I do that? I know how to do it in R but not sure how to do it in python. Here is the code in R for comparing the two models -

anova(albumSales.2, albumSales.3)

Result -

Model 1: sales ~ adverts
Model 2: sales ~ adverts + airplay + attract
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    198 862264                                  
2    196 434575  2    427690 96.447 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 

Based on the above result we can see that albumSales.3 significantly improved the fit of the model to the data compared to albumSales.2, F(2, 196) = 96.44, p < .001.

How can I do it in python?

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
bhola prasad
  • 675
  • 7
  • 22

2 Answers2

2

In the anova, you basically calculate the difference in RSS. You can check more under the vignette for ANOVA in statsmodels:

import pandas as pd
import seaborn as sns
import numpy as np

iris = sns.load_dataset('iris')

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

iris.head()

    sepal_length    sepal_width petal_length    petal_width species
0   5.1 3.5 1.4 0.2 setosa
1   4.9 3.0 1.4 0.2 setosa
2   4.7 3.2 1.3 0.2 setosa
3   4.6 3.1 1.5 0.2 setosa
4   5.0 3.6 1.4 0.2 setosa

We run two models and do the anova:

full_lm = ols("sepal_length ~ petal_length+petal_width", data=iris).fit()
reduced_lm = ols("sepal_length ~ petal_length", data=iris).fit()
anova_lm(reduced_lm,full_lm)

    df_resid    ssr df_diff ss_diff F   Pr(>F)
0   148.0   24.525034   0.0 NaN NaN NaN
1   147.0   23.880694   1.0 0.64434 3.9663  0.048272

It throws some warning (you can see it on the website I linked above) because for the first row it cannot calculate the F etc.

Note, this is different from calculating the Rsquare as proposed in the other answer. One important issue to note is that if you include more terms, your R-squared would theoretically increase, and you want to see whether the terms are significantly explaining additional variance, which is why you use an anova.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
1

I don't know of a single function that can compare two models directly as the sample from R, however the Scikit-Learn package is a very commonly used Python package for data science and machine learning. It has support for various metrics related to regression models which will allow you to create your own comparison.

For example, it has support for the R2 measure. The below example is from Scikit's documentation on R2:

>>> from sklearn.metrics import r2_score
>>> y_true = [3, -0.5, 2, 7]
>>> y_pred = [2.5, 0.0, 2, 8]
>>> r2_score(y_true, y_pred)
0.948...
>>> y_true = [[0.5, 1], [-1, 1], [7, -6]]
>>> y_pred = [[0, 2], [-1, 2], [8, -5]]
>>> r2_score(y_true, y_pred,
...          multioutput='variance_weighted')
0.938...
>>> y_true = [1, 2, 3]
>>> y_pred = [1, 2, 3]
>>> r2_score(y_true, y_pred)
1.0
>>> y_true = [1, 2, 3]
>>> y_pred = [2, 2, 2]
>>> r2_score(y_true, y_pred)
0.0
>>> y_true = [1, 2, 3]
>>> y_pred = [3, 2, 1]
>>> r2_score(y_true, y_pred)
-3.0

By doing this for both of your models, you can get a similar comparison to that you got from R.

gallen
  • 1,252
  • 1
  • 18
  • 25