-2

I have a dataframe d and one of the columns is price (Numerical) having 109248 rows. I divided the data into two parts d_train and d_test. d_train has 73196 values and d_test has 36052 values. Now to normalize d_train['price'] and d_test['price'] i did something like this..

price_scalar = Normalizer()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(1, -1)
X_test_price = price_scalar.transform(d_test['price'].values.reshape(1, -1))

Now I'm having this issue

ValueError                                Traceback (most recent call last)
<ipython-input-20-ba623ca7bafa> in <module>()
3 X_train_price = price_scalar.fit_transform(X_train['price'].values.reshape(1, -1))
----> 4 X_test_price = price_scalar.transform(X_test['price'].values.reshape(1, -1))
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _check_n_features(self, X, reset)
394         if n_features != self.n_features_in_:
395             raise ValueError(
397                 f"is expecting {self.n_features_in_} features as input."
398             )
ValueError: X has 36052 features, but Normalizer is expecting 73196 features as input.

Doing change: reshape(-1,1) instead of reshape(1,-1) runs ok but makes all row values of price to 1.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • What kind of normalization are you trying to achieve? Normalize features or data points? To an absolute range or statistically (e.g., to have unit standard deviation)? – ATony Dec 09 '21 at 16:28
  • I'm jus trying to get values between 0 to 1 – Aakash Verma Dec 09 '21 at 16:42
  • Why on earth you are reshaping in the first place? You speak of values (73196 and 36052), but the error clearly indicates that these are seen as *features* (natural, after reshaping), hence the expected error. You should not reshape in the code you show here. – desertnaut Dec 09 '21 at 23:56
  • # normalizer.fit(X_train['price'].values) # this will rise an error Expected 2D array, got 1D array instead: # array=[105.22 215.96 96.01 ... 368.98 80.53 709.67]. # Reshape your data either using # array.reshape(-1, 1) if your data has a single feature # array.reshape(1, -1) if it contains a single sample. This is the reason – Aakash Verma Dec 10 '21 at 04:56

3 Answers3

0

Reshape(-1, 1) is Ok.The results with 1 is what is expected if you use Normalizer from sklearn: Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

Eric Marchand
  • 619
  • 3
  • 10
0

scikit-learn always assumes that the data is organized with shape (n_points, n_features) (i.e., each row is a data point). Also, from the documentation, Normalizer normalizes "samples individually to unit norm". This means that each data point (i.e., row) is normalized, rather than along the column (i.e., all price values).

To normalize the values to the [0, 1] range, you should use the MinMaxScaler with the data reshaped into a column. That is,

from sklearn.preprocessing import MinMaxScaler
price_scalar = MinMaxScaler()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(-1, 1))
X_test_price = price_scalar.transform(d_test['price'].values.reshape(-1, 1))

It it noteworthy that this does not guarantee that the price values in the test set are all within the [0, 1] range. That is the way it should be when learning an ML model, but remember to keep that in mind.

ATony
  • 683
  • 2
  • 12
0

Here, you can directly fit_transform() function, instead of fit() and transform() function separately.

price_scalar = Normalizer()
X_train_price = price_scalar.fit_transform(d_train['price'].values.reshape(1, -1)
X_test_price = price_scalar.fit_transform(d_test['price'].values.reshape(1, -1))
Hemang Dhanani
  • 175
  • 1
  • 4