2

I am new to using the MinMaxScaler, so please do not bite my head of if this is a very, very simple question. Below, I have the following datatset:

sample_df.head(2)

ID     S_LENGTH     S_WIDTH     P_LENGTH     P_WIDTH     SPECIES
-------------------------------------------------------------------
1      3.5          2.5          5.6         1.7        VIRGINICA
2      4.5          5.6          3.4         8.7         SETOSA

Therefore, how to I apply normalisation to this dataset using the following code below to all my columns, excluding the ID and SPECIES columns?

I basically want to use the preprocessing.MinMaxScaler() to apply normalisation, so that all the features are in a range of 0 and 1.

This is the code I am using...

min_max = preprocessing.MinMaxScaler()
min_max.fit_transform(sample_df)

...but when I execute it, I get this error:

ValueError: could not convert string to float: 'SETOSA'

Any help on how to accomplish what I want to do is much appreciated!

Also, my sincere apologies if this is a really dumb question, but I am new to this.

Thank you!

EDIT (SHOWING ERROR):

Alternatively, if I do this...

min_max = preprocessing.MinMaxScaler()
min_max.fit_transform(sample_df[['S_LENGTH', 'S_WIDTH']])

sample_df.head(2)

...I get this error:

AttributeError: 'numpy.ndarray' object has no attribute 'sample'

1 Answers1

1

I doubt this will be very helpful but, you can get the numeric columns with:

num_df = df[[i for i in df.columns if df[i].dtypes != 'O']]

num_df
Out[126]: 
   ID  S_LENGTH  S_WIDTH  P_LENGTH  P_WIDTH
0   1       3.5      2.5       5.6      1.7
1   2       4.5      5.6       3.4      8.7

and then apply the MinMaxScaler on it:

min_max = preprocessing.MinMaxScaler()
min_max.fit_transform(num_df)

Out[129]:
array([[0., 0., 0., 1., 0.],
       [1., 1., 1., 0., 1.]])

EDIT: Using your df:

df
Out[162]: 
   ID  S_LENGTH  S_WIDTH  P_LENGTH  P_WIDTH    SPECIES
0   1       3.5      2.5       5.6      1.7  VIRGINICA
1   2       4.5      5.6       3.4      8.7     SETOSA

Use the following code:

num_df = min_max.fit_transform(pd.DataFrame((df[[i for i in df.columns if df[i].dtypes != 'O']])))
num_df.columns = [i for i in df.columns if df[i].dtypes != 'O']
cat_df = (df[[i for i in df.columns if df[i].dtypes == 'O']])
res = pd.merge(num_df,cat_df,left_index=True,right_index=True)

which will give you:

print(res)

    ID  S_LENGTH  S_WIDTH  P_LENGTH  P_WIDTH    SPECIES
0  0.0       0.0      0.0       1.0      0.0  VIRGINICA
1  1.0       1.0      1.0       0.0      1.0     SETOSA

Try line by line the code and let me know if this is what you need.

sophocles
  • 13,593
  • 3
  • 14
  • 33
  • Hi - thanks for the reply. But when I do that, and try and re-sample my data, I get this error: ```AttributeError: 'numpy.ndarray' object has no attribute 'sample'``` –  Jan 10 '21 at 12:25
  • Can you please show me your code so that I can see where the error comes from? – sophocles Jan 10 '21 at 12:35
  • I have added the code where the error shows. –  Jan 10 '21 at 12:39
  • I think this is because ```MinMaxScaler``` returns an array. Try changing your code to this: ```import pandas as pd```, ```sample_df = pd.DataFrame(min_max.fit_transform(sample_df[['S_LENGTH', 'S_WIDTH']]))```, ```sample_df.head(2)``` – sophocles Jan 10 '21 at 12:44
  • Thanks. I tried this, which prevented the error. But when re-sampling the data, I lose the column names and it only shows them two columns. –  Jan 10 '21 at 13:00
  • Yes that makes sense. So let me understand what you're looking for. You would like to Normalise the numerical variables in a dataframe but also keep the categorical ones in the same dataframe right? If that, I will update my answer and help you get that. Just confirm please – sophocles Jan 10 '21 at 13:02
  • 1
    Correct. That is exactly what I want to do. Normalise the numerical columns, but retain all the columns in the Normalised dataframe. –  Jan 10 '21 at 13:04