2

I am fairly new to data science (I'm using python) and found that it's better for us to standardize or normalize our data before we go further. My questions are :

  1. What if there are categorical values (binary and using one hot encoding, 0 or 1) such as male or female, do we need to standardize or normalize this kind of data?
  2. What if the categorical data is non-binary, for example, measurement of your health (1= poor, 2=quite healthy, 3=healthy, 4=fit, 5=very fit). Do we still need to do the standardize or normalize these kind of data?

1 Answers1

1

If you have more than 2 categorical value, better convert them one hot encoding. Categorical values should not have a mathematical relationship. If you can not explain a mathematical order between your categorical data (e.g. fit > healthy? ) You should create one hot vectors and represent it as features :

                 Old Version      New version
                              1th 2th  3th  4th  5th  6th
poor              1           0   0    0    0    0     1
quite             2           0   0    0    0    1     0
healthy           3           0   0    0    1    0     0
healthy           4           0   0    1    0    0     0
fit               5           0   1    0    0    0     0
very fit          6           1   0    0    0    0     0

Basically you have 6 new features and each of them represent your one category.

Note: There is no need to apply normalization or standartization to binary data because it's already in [0,1]

Inputvector
  • 1,061
  • 10
  • 22
  • I see, but what if I have other columns containing the data, for example, age and salary, where age ranges from 1 to 80 and salary from, for example. 5000 to 500000, SInce they are in different ranges, should I apply normalization or standardization to both of them? So for example, I do normalization or standardization to age and salary but nor for one hot encoded category? – marvel sugi Mar 07 '21 at 05:28
  • You can explain mathematical relationship between 5000 and 500000 so you can apply normalization to age and salary. Basically if there is a mathematical relationship between values you can apply normalization. – Inputvector Mar 07 '21 at 05:59