0

I'm using sklearn to try to train a binary classification decision tree to classify spam vs not spam. My classification threshold is 50% (i.e I'll flag it as Spam if I think there's a 50%+ chance that it is). Assume the classes aren't imbalanced.

Imagine one branch of my tree has 5000 non-spam samples and 100 spam. The tree continues to split this down further, for example leaf A has 1000 non-spam and 70 spam, leaf B has 4000 non-spam and 30 spam. This split doesn't get pruned because it significantly reduces the gini, but based on my 50% classification threshold this split doesn't actually change any predictions - everything will still be predicted as non-spam.

It feels like logically there should be some way of automatically pruning a classification tree based on a classification threshold, but other than manually inspecting the tree I can't think of how to do this and I've been unable to turn up any solutions through Google. I could decrease the max_depth or increase the min_impurity_decrease, but both of those would penalise other branches by removing useful splits.

Apollo
  • 11
  • 3

1 Answers1

0

This split doesn't get pruned because it significantly reduces the gini, but based on my 50% classification threshold this split doesn't actually change any predictions - everything will still be predicted as non-spam.

This is incorrect. Image that a further splits your 4000 non-spam/30 spam (= 4000/30) into two branches one with 4000/0 and one with 0/30. Then the latter will cause prediction of spam at your 50% threshold. This example is excessively cherry-picked (and I cannot bother putting together synthetic data to illustrate) but you cannot rule it out; hence there is typically no termination criterium based on class ratio (for very imbalanced datasets it would not work well), and max depth of gini gains are more common thresholds.

Learning is a mess
  • 7,479
  • 7
  • 35
  • 71
  • I'm aware that a further split could result in a different classification, I'm specifically asking about cases where that doesn't happen - the terminal node is 4000/30. At that point after building the tree I would like to prune that split. I'm not asking about termination criteria (since the split could become useful later), I'm asking about pruning once the full tree has been built and we know whether the split becomes useful. – Apollo Jul 03 '23 at 12:05