I'm using sklearn to try to train a binary classification decision tree to classify spam vs not spam. My classification threshold is 50% (i.e I'll flag it as Spam if I think there's a 50%+ chance that it is). Assume the classes aren't imbalanced.
Imagine one branch of my tree has 5000 non-spam samples and 100 spam. The tree continues to split this down further, for example leaf A has 1000 non-spam and 70 spam, leaf B has 4000 non-spam and 30 spam. This split doesn't get pruned because it significantly reduces the gini, but based on my 50% classification threshold this split doesn't actually change any predictions - everything will still be predicted as non-spam.
It feels like logically there should be some way of automatically pruning a classification tree based on a classification threshold, but other than manually inspecting the tree I can't think of how to do this and I've been unable to turn up any solutions through Google. I could decrease the max_depth or increase the min_impurity_decrease, but both of those would penalise other branches by removing useful splits.