0

I have a dataset with 4 categorical features (Cholesterol, Systolic Blood pressure, diastolic blood pressure, and smoking rate). I use a decision tree classifier to find the probability of stroke.

I am trying to verify my understanding of the splitting procedure done by Python Sklearn.

Since it is a binary tree, there are three possible ways to split the first feature which is either to group categories {0 and 1 to a leaf, 2 to another leaf} or {0 and 2, 1}, or {0, 1 and 2}. What I know (please correct me here) is that the chosen split is the one with the least information gain (Gini impurity).

I have calculated the information gain for each of the three grouping scenarios:

{0 + 1 , 2} --> 0.17

{0 + 2 , 1} --> 0.18

{1 + 2 , 0} --> 0.004

However, sklearn's decision tree chose the first scenario instead of the third (please check the picture).

Can anyone please help clarify the reason for the selection? is there a priority for splits that results in pure nodes. thus selecting such a scenario although it has less information gain?

enter image description here

GYSHIDO
  • 33
  • 10

1 Answers1

0

The algorithm does splits based on maximizing the information gain (=minimizing the entropy): https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

hellpanderr
  • 5,581
  • 3
  • 33
  • 43