1
saleprice_scaled = /
StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);

Why is newaxis being used here? I know newaxis, but I can't figure out it's use in this particular situations.

Seanny123
  • 8,776
  • 13
  • 68
  • 124

1 Answers1

6

df_train['SalePrice'] is a Pandas.Series (vector / 1D array) of a shape: (N elements,)

Modern (version: 0.17+) SKLearn methods don't like 1D arrays (vectors), they expect 2D arrays.

df_train['SalePrice'][:,np.newaxis]

transforms 1D array (shape: N elements) into 2D array (shape: N rows, 1 column).

Demo:

In [21]: df = pd.DataFrame(np.random.randint(10, size=(5, 3)), columns=list('abc'))

In [22]: df
Out[22]:
   a  b  c
0  4  3  8
1  7  5  6
2  1  3  9
3  7  5  7
4  7  0  6

In [23]: from sklearn.preprocessing import StandardScaler

In [24]: df['a'].shape
Out[24]: (5,)      # <--- 1D array

In [25]: df['a'][:, np.newaxis].shape
Out[25]: (5, 1)    # <--- 2D array

There is Pandas way to do the same:

In [26]: df[['a']].shape
Out[26]: (5, 1)    # <--- 2D array

In [27]: StandardScaler().fit_transform(df[['a']])
Out[27]:
array([[-0.5 ],
       [ 0.75],
       [-1.75],
       [ 0.75],
       [ 0.75]])

What happens if we will pass 1D array:

In [28]: StandardScaler().fit_transform(df['a'])
C:\Users\Max\Anaconda4\lib\site-packages\sklearn\utils\validation.py:429: DataConversionWarning: Data with input dtype int32 was converted t
o float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
C:\Users\Max\Anaconda4\lib\site-packages\sklearn\preprocessing\data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0
.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
 if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
C:\Users\Max\Anaconda4\lib\site-packages\sklearn\preprocessing\data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0
.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
 if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
Out[28]: array([-0.5 ,  0.75, -1.75,  0.75,  0.75])
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419