how to normalize test and train data as both having different different number of rows

Question

I have a dataframe d and one of the columns is price (Numerical) having 109248 rows. I divided the data into two parts d_train and d_test. d_train has 73196 values and d_test has 36052 values. Now to normalize d_train['price'] and d_test['price'] i did something like this..

price_scalar = Normalizer()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(1, -1)
X_test_price = price_scalar.transform(d_test['price'].values.reshape(1, -1))

Now I'm having this issue

ValueError                                Traceback (most recent call last)
<ipython-input-20-ba623ca7bafa> in <module>()
3 X_train_price = price_scalar.fit_transform(X_train['price'].values.reshape(1, -1))
----> 4 X_test_price = price_scalar.transform(X_test['price'].values.reshape(1, -1))
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _check_n_features(self, X, reset)
394         if n_features != self.n_features_in_:
395             raise ValueError(
397                 f"is expecting {self.n_features_in_} features as input."
398             )
ValueError: X has 36052 features, but Normalizer is expecting 73196 features as input.

Doing change: reshape(-1,1) instead of reshape(1,-1) runs ok but makes all row values of price to 1.

What kind of normalization are you trying to achieve? Normalize features or data points? To an absolute range or statistically (e.g., to have unit standard deviation)? — ATony, Dec 09 '21 at 16:28
Why on earth you are reshaping in the first place? You speak of values (73196 and 36052), but the error clearly indicates that these are seen as *features* (natural, after reshaping), hence the expected error. You should not reshape in the code you show here. — desertnaut, Dec 09 '21 at 23:56
# normalizer.fit(X_train['price'].values) # this will rise an error Expected 2D array, got 1D array instead: # array=[105.22 215.96 96.01 ... 368.98 80.53 709.67]. # Reshape your data either using # array.reshape(-1, 1) if your data has a single feature # array.reshape(1, -1) if it contains a single sample. This is the reason — Aakash Verma, Dec 10 '21 at 04:56

score 0 · Answer 1 · answered Dec 09 '21 at 16:32

0

Reshape(-1, 1) is Ok.The results with 1 is what is expected if you use Normalizer from sklearn: Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

answered Dec 09 '21 at 16:32

Eric Marchand

619
3
10

then how to to get both between 0 to 1...that's what I'm asking – Aakash Verma Dec 09 '21 at 16:44
Maybe what you need is more something like StandardScaler or MinMaxScaler. – Eric Marchand Dec 09 '21 at 16:46
i tried that...with them also the same issue i have – Aakash Verma Dec 09 '21 at 17:05
If you tried price_scaler = StandardScaler(), without option,, for example, you cannot get output rows with only 1 values as the mean is 0 for train data after fit_transform. – Eric Marchand Dec 09 '21 at 17:15

score 0 · Answer 2 · answered Dec 09 '21 at 18:55

scikit-learn always assumes that the data is organized with shape (n_points, n_features) (i.e., each row is a data point). Also, from the documentation, Normalizer normalizes "samples individually to unit norm". This means that each data point (i.e., row) is normalized, rather than along the column (i.e., all price values).

To normalize the values to the [0, 1] range, you should use the MinMaxScaler with the data reshaped into a column. That is,

from sklearn.preprocessing import MinMaxScaler
price_scalar = MinMaxScaler()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(-1, 1))
X_test_price = price_scalar.transform(d_test['price'].values.reshape(-1, 1))

It it noteworthy that this does not guarantee that the price values in the test set are all within the [0, 1] range. That is the way it should be when learning an ML model, but remember to keep that in mind.

score 0 · Answer 3 · answered Dec 09 '21 at 19:19

Here, you can directly fit_transform() function, instead of fit() and transform() function separately.

price_scalar = Normalizer()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(1, -1)
X_test_price = price_scalar.fit_transform(d_test['price'].values.reshape(1, -1))

how to normalize test and train data as both having different different number of rows

3 Answers3