4

In earlier versions of sklearn's MinMaxScaler one could specify the minimum and maximum values based on which the scaler would normalize the data. In other words, the following was possible:

from sklearn import preprocessing
import numpy as np
x_data = np.array([[66,74,89], [1,44,53], [85,86,33], [30,23,80]])
scaler = preprocessing.MinMaxScaler()
scaler.fit ([-90, 90])
b = scaler.transform(x_data)

This would cause the array above to be scaled to the range of (0,1) with the minimum possible value of -90 becoming 0, the maximum possible value of 90 becoming 1 and with all the values in-between getting scaled accordingly. With version 0.21 of sklearn this throws an error:

ValueError: Expected 2D array, got 1D array instead:
array=[-90.  90.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

I turned scaler.fit ([-90, 90]) to scaler.fit ([[-90, 90]]), but then I got:

ValueError: operands could not be broadcast together with shapes (4,3) (2,) (4,3)

I know for a fact that I can do scaler.fit (x_data), but this leads to the following result after tranform:

 [0.         0.33333333 0.35714286]
 [1.         1.         0.        ]
 [0.3452381  0.         0.83928571]]

My issue with that is twofold: 1) the numbers do not seem to be correct. They were supposed to be scaled between 0 and 1, but I get many 0s and many 1s for values that should be higher and lower respectively. 2) what if I want to scale every future array to a range of (0,1) based on a fixed range of, say, (-90. 90)? This was a convenient feature, but now I have to use a specific array to do my scaling. What is more, the scaling will produce different results every time because I will have to fit every future array anew, thus receiving variable results.

Am I missing something here? Is there a way to keep this nifty feature? And if there isn't, how will I make sure my data is scaled correcty and consistently every time?

Matt Hall
  • 7,614
  • 1
  • 23
  • 36
J. Fenigan
  • 119
  • 2
  • 9

1 Answers1

4

It seems that the problem is not in the scikit-learn package version but in the shape of input data for fit() method of MinMaxScaler object:

import numpy as np
import sklearn
from sklearn.preprocessing import MinMaxScaler

print('scikit-learn package version: {}'.format(sklearn.__version__))
# scikit-learn package version: 0.21.3

scaler = MinMaxScaler()
x_sample = [-90, 90]
scaler.fit(np.array(x_sample)[:, np.newaxis]) # reshape data to satisfy fit() method requirements
x_data = np.array([[66,74,89], [1,44,53], [85,86,33], [30,23,80]])

print(scaler.transform(x_data))

# [[0.86666667 0.91111111 0.99444444]
# [0.50555556 0.74444444 0.79444444]
# [0.97222222 0.97777778 0.68333333]
# [0.66666667 0.62777778 0.94444444]]

To learn about input data requirements of such popular preprocessors like StandardScaler, MinMaxScaler etc. you can see my answer to another problem with StandardScaler.fit() input.

Eduard Ilyasov
  • 3,268
  • 2
  • 20
  • 18
  • Thank you, it works perfectly! I also figured out that all MinMaxScaler does is that it effectively scales all values as follows: x_data = (x_data+abs(min)) / (2*abs(min)), where min is -90 – J. Fenigan Oct 28 '19 at 00:23