1

I work with a decision tree algorithm on a binary classification problem and the goal is to minimise false positives (maximise positive predicted value) of the classification (the cost of a diagnostic tool is very high).

Is there a way to introduce a weight in gini / entropy splitting criteria to penalise for false positive misclassifications?

Here for example, the modified Gini index is given as:

enter image description here

Therefore I am wondering if there any way to implement it in Scikit-learn?

EDIT

Playing with class_weight produced the following results:

from sklearn import datasets as dts
iris_data = dts.load_iris()

X, y = iris_data.features, iris_data.targets
# take only classes 1 and 2 due to less separability
X = X[y>0]
y = y[y>0]
y = y - 1 # make binary labels

# define the decision tree classifier with only two levels at most and no class balance
dt = tree.DecisionTreeClassifier(max_depth=2, class_weight=None)

# fit the model, no train/test for simplicity
dt.fit(X[:55,:2], y[:55])

plot the decision boundary and the tree Blue are positive (1):

enter image description here

While outweighing the minority class (or more precious):

dt_100 = tree.DecisionTreeClassifier(max_depth=2, class_weight={1:100})

enter image description here

Arnold Klein
  • 2,956
  • 10
  • 31
  • 60

1 Answers1

1

Decision Tree classifiers support the class_weight argument.

In two class problems, this can exactly solve your issue. Typically this is used for unbalanced problems. For more than two classes, it is not possible to provide the individual labels (as far as I know)

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • thanks, I tried, but when I outweight one class, the decision tree and the selected features / thresholds changes drastically as function of the class weights. I will update my answer with an example. Class weight simply multiplies the number of data points. – Arnold Klein Apr 24 '18 at 13:31
  • A (drastic) change of the tree has to be expected! Why does class weight multiply the number of data points? Where did you find that? – Quickbeam2k1 Apr 24 '18 at 13:34
  • I updated the question. Based on the tree diagram, you can easily see the difference in the number of points. The `class_weight` option simply creates multiple copies of the specified class. Controlled bootstrap. – Arnold Klein Apr 24 '18 at 13:55
  • Not sure about the copies, but I think the idea is related. Anyway, you are penalizing the classification results of one class. You see in the second image, that misclassifying blue is more expensive than red. So what would you expect? So is the problem that the gini number is always the same in the top node? – Quickbeam2k1 Apr 24 '18 at 14:17
  • Ideally I would like to control, how the splitting criteria works (as I showed in the first picture). Well, I think I'm getting the idea. Will experiment more. Thanks! – Arnold Klein Apr 24 '18 at 14:29
  • 1
    But you are doing that: In your mini formula, you increase the index with L12=2, say. If you now think of resampling, and you resample the one class twice as often, you will just add the double amount of terms if you want. Which then leads to the same result. – Quickbeam2k1 Apr 24 '18 at 14:45
  • Right! Will check this out. Many thanks for elaborating. – Arnold Klein Apr 24 '18 at 15:01