-3

I have a problem. Is there an option to get early stopping? Because I saw on a plot that I get Overfitting after a while, so I want to get the most optimal.

dfListingsFeature_regression = pd.read_csv(r"https://raw.githubusercontent.com/Coderanker3/dataset4/main/listings_cleaned.csv")
d = {True: 1, False: 0, np.nan : np.nan} 
dfListingsFeature_regression['host_is_superhost'] = dfListingsFeature_regression[
                                                             'host_is_superhost'].map(d).astype('int')

X = dfListingsFeature_regression.drop(columns=['host_id', 'id', 'price']) # Features
y = dfListingsFeature_regression['price'] # Target variable
print(dfListingsFeature_nor.shape)


steps = [('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=1000))),
         ('lasso', Lasso(alpha=0.1))]

pipeline = Pipeline(steps) 

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)


parameteres = { }

grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)                
grid.fit(X_train, y_train)
                    
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))

# Prediction
y_pred = grid.predict(X_test)

print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))

y_train_predict = grid.predict(X_train)
print("Train:" , metrics.mean_squared_error(y_train, y_train_predict , squared=False))

r2 = metrics.r2_score(y_test, y_pred)
print(r2)
Test
  • 571
  • 13
  • 32

1 Answers1

2

I think you mean applying regularization. In this case, we can reduce the chance of overfitting with l1 regularization or Lasso regression.

This regularization strategy is a kind of "feature selection" when you have several features, as it would shrink coefficients of non informative features toward zero.

In this case, you wanto to find the optimal alpha value that finds the best score in the test dataset. Additionally you can plot the gap difference between train/test score to guide your decision.

The stronger the alpha value the stronger the regularization. See code example below.

Full Example

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.linear_model import Lasso

import numpy as np
import matplotlib.pyplot as plt

X, y = make_regression(noise=4, random_state=0)

# Alphas to search over
alphas = list(np.linspace(2e-2, 1, 20))

results = {}

for alpha in alphas:
    
    print(f'Fitting Lasso(alpha={alpha})')
    
    estimator = Lasso(alpha=alpha, random_state=0)

    cv_results = cross_validate(
        estimator, X, y, cv=5, return_train_score=True, scoring='neg_root_mean_squared_error'
    )
    
    # Comput average metric value
    avg_train_score = np.mean(cv_results['train_score']) * -1
    
    avg_test_score = np.mean(cv_results['test_score']) * -1
    
    results[alpha] = (avg_train_score, avg_test_score)

train_scores = [v[0] for v in results.values()]
test_scores = [v[1] for v in results.values()]
gap_scores = [v[1] - v[0] for v in results.values()]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

ax1.set_title('Alpha values vs Avg score')
ax1.plot(results.keys(), train_scores, label='Train Score')
ax1.plot(results.keys(), test_scores, label='Test Score')
ax1.legend()

ax2.set_title('Train/Test Score Gap')
ax2.plot(results.keys(), gap_scores)

enter image description here

Notice than when alpha is close to zero it is overfitting and when lambda gets bigger it is underfitting. However, around alpha=0.4 we can find a balance between underfitting and overfitting the data.

Miguel Trejo
  • 5,913
  • 5
  • 24
  • 49