10

I have a basic decision tree classifier with Scikit-Learn:

#Used to determine men from women based on height and shoe size

from sklearn import tree

#height and shoe size
X = [[65,9],[67,7],[70,11],[62,6],[60,7],[72,13],[66,10],[67,7.5]]

Y=["male","female","male","female","female","male","male","female"]

#creating a decision tree
clf = tree.DecisionTreeClassifier()

#fitting the data to the tree
clf.fit(X, Y)

#predicting the gender based on a prediction
prediction = clf.predict([68,9])

#print the predicted gender
print(prediction)

When I run the program, it always outputs either "male" or "female", but how would I be able to see the probability of the prediction being male or female? For example, the prediction above returns "male", but how would I get it to print the probability of the prediction being male?

Thanks!

Davis Keene
  • 213
  • 1
  • 5
  • 12
  • As answers have noted, you can use `predict_proba`, but beware the probabilities aren't very good: https://rpmcruz.github.io/machine%20learning/2018/02/09/probabilities-trees.html – Max Ghenis Dec 03 '18 at 02:41
  • "aren't very good" is an understatement. Because you're using a decision tree, every sample is in the "male" branch or the "female" branch. So the probability will always be 1. – Teepeemm Dec 05 '19 at 14:48

3 Answers3

7

You can do something like the following:

from sklearn import tree

#load data
X = [[65,9],[67,7],[70,11],[62,6],[60,7],[72,13],[66,10],[67,7.5]]
Y=["male","female","male","female","female","male","male","female"]

#build model
clf = tree.DecisionTreeClassifier()

#fit
clf.fit(X, Y)

#predict
prediction = clf.predict([[68,9],[66,9]])

#probabilities
probs = clf.predict_proba([[68,9],[66,9]])

#print the predicted gender
print(prediction)
print(probs)

Theory

The result of clf.predict_proba(X) is: The predicted class probability which is the fraction of samples of the same class in a leaf.

Interpretation of the results:

The first print returns ['male' 'male'] so the data [[68,9],[66,9]] are predicted as males.

The second print returns:

[[ 0. 1.] [ 0. 1.]]

This means that the data were predicted as males and this is reported by the ones in the second column.

To see the order of the classes use: clf.classes_

This returns: ['female', 'male']

seralouk
  • 30,938
  • 9
  • 118
  • 133
  • Thank you for this! I decided to use a SVM instead of a decision tree for this problem, since it has a probability parameter. But this is a great answer! – Davis Keene Nov 13 '17 at 13:50
  • 3
    I don't think you've provided a probability in the sense that OP was looking for. This just returns a similarly binary answer. I'm assuming OP is looking to provide some confidence, as in a logistic regression, so that most values will be between 0 and 1 but not equal to 0 or 1. I'm not even sure that's possible. – DangerousDave Sep 23 '18 at 17:11
  • I believe that this is exactly what the OP asked for. He has also accepted my answer. – seralouk Sep 24 '18 at 11:32
  • It just happens that the tree predicts 0% and 100% probabilities in this case. Other data will produce different probabilities. If anyone happens to know whether something similar can be done for `DecisionTreeRegressor`s, I asked at https://stackoverflow.com/questions/53586860/equivalent-of-predict-proba-for-decisiontreeregressor. – Max Ghenis Dec 03 '18 at 03:12
1

Sounds like you need to read the sklearn documentation for DecisionTreeClassifier and see:

predict_proba(X[, check_input])
Coloane
  • 319
  • 1
  • 4
  • 12
  • I looked at the documentation a little. I tried to call print(clf.predict_proba(X)) and I got this result: [[ 0. 1.] [ 1. 0.] [ 0. 1.] [ 1. 0.] [ 1. 0.] [ 0. 1.] [ 0. 1.] [ 1. 0.]] What does this mean? – Davis Keene Nov 12 '17 at 17:40
  • You provided the data X, Y and you've asked the algorithm to predict X. That's why the probabilities are showing up as [0. 1.] – Coloane Nov 12 '17 at 17:48
  • Just to clarify further, enter predict_proba(`what you are trying to predict`), not X. Does this make sense? – Coloane Nov 12 '17 at 17:54
  • Oh, okay. So I would do predict_proba([68,9])? – Davis Keene Nov 12 '17 at 22:10
1

the answer in my top is correct, you are getting binary output because your tree is complete and not truncate in order to make your tree weaker, you can use max_depth to a lower depth so probability won't be like [0. 1.] it will look like [0.25 0.85] another problem here is that the dataset is very small and easy to solve so better to use a more complex dataset some link that might make this more clear for you mate https://rpmcruz.github.io/machine%20learning/2018/02/09/probabilities-trees.html https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict_proba