7

I need to select some features from dataset for a regression task. But the numerical values are from different ranges.

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest, f_regression

X, y = load_boston(return_X_y=True)
X_new = SelectKBest(f_regression, k=2).fit_transform(X, y)

To increase the performance of regression model do I need to normalize X before SelectKBest method?

user3104352
  • 1,100
  • 1
  • 16
  • 34
  • 3
    Whether you _need_ to perform normalization is up for debate, but it will often give the same or better results. The only real way to tell is to run the model with and without, and see how well it generalizes to unseen data – G. Anderson Oct 15 '18 at 22:22

1 Answers1

1

The answer is that it depends on your data -- so you should try it to see if it helps! Here's a quick way to transform each variable so that it has a mean of 0 and variance of 1:

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import StandardScaler

X, y = load_boston(return_X_y=True)

scaler_x = StandardScaler().fit(X)
X = scaler_x.transform(X)

X_new = SelectKBest(f_regression, k=2).fit_transform(X, y)
killian95
  • 803
  • 6
  • 11