Well, there are important differences between how OneHot Encoding and Label Encoding work :
- Label Encoding will basically switch your String variables to
int
. In this case, the 1st class found will be coded as 1
, the 2nd as 2
, ...
But this encoding creates an issue.
Let's take the example of a variable Animal = ["Dog", "Cat", "Turtle"]
.
If you use Label Encoder on it, Animal
will be [1, 2, 3]
. If you parse it to your machine learning model, it will interpret Dog
is closer than Cat
, and farther than Turtle
(because distance between 1
and 2
is lower than distance between 1
and 3
).
Label encoding is actually excellent when you have ordinal variable.
For example, if you have a value Age = ["Child", "Teenager", "Young Adult", "Adult", "Old"]
,
then using Label Encoding is perfect. Child
is closer than Teenager
than it is from Young Adult
. You have a natural order on your variables
- OneHot Encoding (also done by pd.get_dummies) is the best solution when you have no natural order between your variables.
Let's take back the previous example of Animal = ["Dog", "Cat", "Turtle"]
.
It will create as much variable as classes you encounter. In my example, it will create 3 binary variables : Dog, Cat and Turtle
. Then if you have Animal = "Dog"
, encoding will make it Dog = 1, Cat = 0, Turtle = 0
.
Then you can give this to your model, and he will never interpret that Dog
is closer from Cat
than from Turtle
.
But there are also cons to OneHotEncoding. If you have a categorical variable encountering 50 kind of classes
eg : Dog, Cat, Turtle, Fish, Monkey, ...
then it will create 50 binary variables, which can cause complexity issues. In this case, you can create your own classes and manually change variable
eg : regroup Turtle, Fish, Dolphin, Shark
in a same class called Sea Animals
and then appy a OneHotEncoding.