0

As to improve my LinearRegression model I was adviced to use Standardization, i.e. RobustScaler for better performance. My shapes of train and validation sets:

Train set: (4304, 20) (4304,)
Validation set: (1435, 20) (1435,)

So I transform my X for both train and validation sets:

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_train_robust_scaler = scaler.fit_transform(X_train.copy())
X_valid_robust_scaler = scaler.transform(X_valid.copy())

Then I run the model and print scores with function print_score():

from sklearn import linear_model

regr_vol_2 = linear_model.LinearRegression()
regr_vol_2.fit(X_train_robust_scaler, y_train)

def print_score(m, X_train: pd.DataFrame, X_valid: pd.DataFrame, y_train: pd.Series, y_valid:pd.Series):
'''Function takes a model and calculates and prints its RMSE values and r² 
scores for train and validation set. Also attaches oob_score for Random 
Forest model.
Parameters:
-----------
(1) m --> given model;
(2) X_train --> training set of independent features;
(3) X_valid --> validation set of independent features;
(4) y_train --> training set of dependent features;
(5) y_valid --> validation set of dependent features;
-----------
Returns scoring values in the following order: 
[training rmse, validation rmse, r² for training set, r² for validation set, 
oob_score_]
'''
res = [rmse(m.predict(X_train), y_train),
       rmse(m.predict(X_valid), y_valid),
       m.score(X_train, y_train), m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
return print(res)


print_score(regr_vol_2,X_train_robust_scaler, X_valid_robust_scaler,y_train, y_valid)
Output [training rmse, validation rmse, r² for training set, r² for validation set
before: [260.86301672800016, 271.8005003802866, 0.6184501389479591, 0.5976532655109332]
after: [260.8630167262612, 271.800437195055, 0.6184501389530468, 0.5976534525773189]

Both quite same result, what I did wrong ? I should you Robustscaler() also for y_train and y_valid ? If I do it:

scaler_y = RobustScaler()
y_train_robust_scaler = scaler_y.fit_transform(y_train[:,None])
y_valid_robust_scaler = scaler_y.transform(y_valid[:,None])

I got the same as without it: | [training rmse, validation rmse, r² for training set, r² for validation set | | -------------- | | [260.8630167262612, 271.800437195055, 0.6184501389530468, 0.5976534525773189]|

Or I should use Robustscaler() on whole data at one time before split ? If 'Yes', how can I do it if I impute NaN values after splitting in train/validation.

unkind58
  • 117
  • 2
  • 11

1 Answers1

0

Scaling does not affect an unpenalized regression. It can improve convergence of the solver, but if the model is converging satisfactorily on the raw data, the results will be the same.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • Can you advice what I need to do in order to raise r-squared metric ? – unkind58 Jan 11 '21 at 12:53
  • 1
    That's too broad a question for the stackexchange format (and off-topic at StackOverflow; see stats.SE or datascience.SE). But a couple of suggestions: try regularized regression (lasso, ridge, or elasticnet; in these cases, definitely scale first; however, your training R2 is not much higher than the validation, so I wouldn't count on much improvement), or try a nonlinear model (either by adding nonlinear terms, or using a natively nonlinear regression model). – Ben Reiniger Jan 11 '21 at 15:03