13

Accidentally I have noticed, that OLS models implemented by sklearn and statsmodels yield different values of R^2 when not fitting intercept. Otherwise they seems to work fine. The following code yields:

import numpy as np
import sklearn
import statsmodels
import sklearn.linear_model as sl
import statsmodels.api as sm

np.random.seed(42)

N=1000
X = np.random.normal(loc=1, size=(N, 1))
Y = 2 * X.flatten() + 4 + np.random.normal(size=N)

sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y)
sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y)
statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X))
statsmodelsNoIntercept = sm.OLS(Y, X)

print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared)
print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared)

print(sklearn.__version__, statsmodels.__version__)

prints:

0.78741906105 0.78741906105
-0.950825182861 0.783154483028
0.19.1 0.8.0

Where the difference comes from?

The question differs from Different Linear Regression Coefficients with statsmodels and sklearn as there sklearn.linear_model.LinearModel (with intercept) was fit for X prepared as for statsmodels.api.OLS.

The question differs from Statsmodels: Calculate fitted values and R squared as it addresses difference between two Python packages (statsmodels and scikit-learn) while linked question is about statsmodels and common R^2 definition. They are both answered by the same answer, however that issue has been arleady discussed here: Does the same answer imply that the questions should be closed as duplicate?

abukaj
  • 2,582
  • 1
  • 22
  • 45
  • What do you mean? -0.72... is quite different from 0.78... – abukaj Feb 16 '18 at 18:39
  • Always seed random data for reproducibility: `np.random.seed(###)`. – Parfait Feb 16 '18 at 19:28
  • 4
    The absence of an intercept changes the definition of R2 in statsmodels. See https://stackoverflow.com/questions/29664471/why-would-r-squared-decrease-when-i-add-an-exogenous-variable-in-ols-using-pytho/29665662#29665662 and https://stackoverflow.com/questions/24851787/statsmodels-calculate-fitted-values-and-r-squared/24852415#24852415 – Josef Feb 16 '18 at 20:23
  • @Parfait agreed. In this particular case I omited seed, as the sample is quite large and results differ every single run. However for the sake of correctness I have updated the example. – abukaj Feb 18 '18 at 18:16
  • @user333700 May you give that as an answer? I would like to accept it. – abukaj Feb 18 '18 at 18:51
  • Possible duplicate of [Statsmodels: Calculate fitted values and R squared](https://stackoverflow.com/questions/24851787/statsmodels-calculate-fitted-values-and-r-squared) – David Dale Mar 11 '18 at 13:58

1 Answers1

2

As pointed by @user333700 in comments, OLS definition of R^2 is different in statsmodels' implementation than in scikit-learn's.

From documentation of RegressionResults class (emphasis mine):

rsquared

R-squared of a model with an intercept. This is defined here as 1 - ssr/centered_tss if the constant is included in the model and 1 - ssr/uncentered_tss if the constant is omitted.

From documentation of LinearRegression.score():

score(X, y, sample_weight=None)

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the residual

sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

abukaj
  • 2,582
  • 1
  • 22
  • 45