The way decision trees and random forest work using splitting logic, I was under the impression that label encoding would not be a problem for these models, as we are anyway going to split the column. For eg: if we have gender as 'male', 'female' and 'other', with label encoding, it becomes 0,1,2 which is interpreted as 0<1<2. But since we are going to split the columns, I thought it didn't matter as it is the same thing whether we are going to split on 'male' or '0'. But when I tried both label and one hot encoding on the dataset, one hot encoding gave better accuracy and precision. Can you kindly share your thoughts.
The ACCURACY SCORE of various models on train and test are:
The accuracy score of simple decision tree on label encoded data : TRAIN: 86.46% TEST: 79.42%
The accuracy score of tuned decision tree on label encoded data : TRAIN: 81.74% TEST: 81.33%
The accuracy score of random forest ensembler on label encoded data: TRAIN: 82.26% TEST: 81.63%
The accuracy score of simple decision tree on one hot encoded data : TRAIN: 86.46% TEST: 79.74%
The accuracy score of tuned decision tree on one hot encoded data : TRAIN: 82.04% TEST: 81.46%
The accuracy score of random forest ensembler on one hot encoded data:TRAIN: 82.41% TEST: 81.66%
he PRECISION SCORE of various models on train and test are:
The precision score of simple decision tree on label encoded data : TRAIN: 78.26% TEST: 57.92%
The precision score of tuned decision tree on label encoded data : hTRAIN: 66.54% TEST: 64.6%
The precision score of random forest ensembler on label encoded data: TRAIN: 70.1% TEST: 67.44%
The precision score of simple decision tree on one hot encoded data : TRAIN: 78.26% TEST: 58.84%
The precision score of tuned decision tree on one hot encoded data : TRAIN: 68.06% TEST: 65.81%
The precision score of random forest ensembler on one hot encoded data: TRAIN: 70.34% TEST: 67.32%