R^2 negative using sklearn and 0.92 using statsmodels

Question

I am so confused. I am comparing lasso and linear regression on a model that predicts housing prices. I don't understand how when I run a linear model in sklearn I get a negative for R^2 yet when I run it in lasso I get a reasonable R^2. I know that you can get a negative R^2 if linear regression is a poor fit for your model so I decided to check it using OLS in statsmodels where I also get a high R^2. I am just confused how this possible and what is going on? Is it due to multicollinearity?

Also, yes I know that I can use grid search cv to find alpha for lasso but my professor wanted us to try it this way in order to get practice coding. I am a math major and this is for a statistics course.

# Linear regression in sklearn

X = df.drop('SalePrice',axis=1)
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=60)
lm = LinearRegression()
lm.fit(X_train, y_train)
predictions_linear = lm.predict(X_test)
print('\nR^2 of linear model is {:.5f}\n'.format(metrics.r2_score(y_test, predictions_linear)))
>>>>R^2 of linear model is -213279628873266528256.00000


# Lasso in sklearn

r2_alpha_lasso = [None]*200
i=0
for num in np.logspace(-4,1,len(r2_alpha_lasso)):
    lasso = linear_model.Lasso(alpha=num, random_state=50)
    lasso.fit(X_train, y_train)
    predictions_lasso = lasso.predict(X_test)
    r2 = metrics.r2_score(y_test, predictions_lasso)
    r2_alpha_lasso[i] = [num, r2]
    i+=1

r2_maximized_lasso = sorted(r2_alpha_lasso, key=itemgetter(1))[-1]
print("\nR^2 maximized where:\n    Alpha: {:.5f}\n    R^2: {:.5f}\n".format(r2_maximized_lasso[0], r2_maximized_lasso[1]))
>>>>R^2 maximized where:
    Alpha: 0.00120
    R^2: 0.90498


# OLS in statsmodels

df['Constant'] = 1
X = df.drop('SalePrice',axis=1)
y = df['SalePrice']
mod = sm.OLS(endog=y, exog=X, data=df)
res = mod.fit()
print(res.summary())  # only printed the relevant results, not the entire table
>>>>R-squared:                       0.921
    Adj. R-squared:                  0.908
    [2] The smallest eigenvalue is 1.26e-29. This might indicate that there are

strong multicollinearity problems or that the design matrix is singular.

For LR one you only use a single run and train test split data without any tuning, for Lasso you are showing the maximum value for R2 amongst 200 runs, and for OLS you are using the whole data without dividing into train and test. Then why do you think these things can be compared? — Vivek Kumar, Nov 27 '18 at 09:16
I am not comparing them directly but the discrepancy in R^2 for lasso and linear model does seem to warrant concern since those both use the same test and training sets. I then looked at the R^2 for OLS to see if I could better understand what was going on. — jmoore00, Nov 27 '18 at 14:12
I again re-iterate what I said above. Even if they are using the same train test data, you are showing here the best output from 200 runs of lasso. Are you sure that all the 200 runs have good scores? Why don't you try 200 runs for LR, with different parameters in that? — Vivek Kumar, Nov 27 '18 at 14:30
It's 200 runs of lasso each with a different parameter for alpha. The chosen model is where R^2 is maximized. Of course they won't all have good scores if the regularization strength is high enough that it sends all the coefficients to zero. The idea is to locate alpha where R^2 is maximized. Alpha is a hyperparameter selected by the researcher building the model. Linear regression has no such hyperparameter. — jmoore00, Nov 27 '18 at 18:37
Exactly my point. The LR and Lasso are two different algorithms, with different hyper-params and different assumptions about the data. So the output may be different. As for sm.OLS, you are sending different data to it, and the methods for fitting are different and definiton used for R2 is also different. See this [question](https://stackoverflow.com/q/48832925/3374996) and [this one](https://stats.stackexchange.com/q/249892/133411) — Vivek Kumar, Nov 28 '18 at 10:12
Of course the output will be different. That is obvious. But the discrepancy in R^2 is still very strange and worth questioning. Also, linear models don't have hyperparameters. — jmoore00, Nov 28 '18 at 16:49

R^2 negative using sklearn and 0.92 using statsmodels

0 Answers0