As to improve my LinearRegression model I was adviced to use Standardization, i.e. RobustScaler for better performance. My shapes of train and validation sets:
Train set: (4304, 20) (4304,)
Validation set: (1435, 20) (1435,)
So I transform my X for both train and validation sets:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train_robust_scaler = scaler.fit_transform(X_train.copy())
X_valid_robust_scaler = scaler.transform(X_valid.copy())
Then I run the model and print scores with function print_score():
from sklearn import linear_model
regr_vol_2 = linear_model.LinearRegression()
regr_vol_2.fit(X_train_robust_scaler, y_train)
def print_score(m, X_train: pd.DataFrame, X_valid: pd.DataFrame, y_train: pd.Series, y_valid:pd.Series):
'''Function takes a model and calculates and prints its RMSE values and r²
scores for train and validation set. Also attaches oob_score for Random
Forest model.
Parameters:
-----------
(1) m --> given model;
(2) X_train --> training set of independent features;
(3) X_valid --> validation set of independent features;
(4) y_train --> training set of dependent features;
(5) y_valid --> validation set of dependent features;
-----------
Returns scoring values in the following order:
[training rmse, validation rmse, r² for training set, r² for validation set,
oob_score_]
'''
res = [rmse(m.predict(X_train), y_train),
rmse(m.predict(X_valid), y_valid),
m.score(X_train, y_train), m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
return print(res)
print_score(regr_vol_2,X_train_robust_scaler, X_valid_robust_scaler,y_train, y_valid)
Output | [training rmse, validation rmse, r² for training set, r² for validation set |
---|---|
before: | [260.86301672800016, 271.8005003802866, 0.6184501389479591, 0.5976532655109332] |
after: | [260.8630167262612, 271.800437195055, 0.6184501389530468, 0.5976534525773189] |
Both quite same result, what I did wrong ? I should you Robustscaler() also for y_train and y_valid ? If I do it:
scaler_y = RobustScaler()
y_train_robust_scaler = scaler_y.fit_transform(y_train[:,None])
y_valid_robust_scaler = scaler_y.transform(y_valid[:,None])
I got the same as without it: | [training rmse, validation rmse, r² for training set, r² for validation set | | -------------- | | [260.8630167262612, 271.800437195055, 0.6184501389530468, 0.5976534525773189]|
Or I should use Robustscaler() on whole data at one time before split ? If 'Yes', how can I do it if I impute NaN values after splitting in train/validation.