Difference between predict vs predict_proba in scikit-learn

Question

Suppose I have created a model, and my target variable is either 0, 1 or 2. It seems that if I use predict, the answer is either of 0, or 1 or 2. But if I use predict_proba, I get a row with 3 cols for each row as follows, for example

   model = ... Classifier       # It could be any classifier
   m1 = model.predict(mytest)
   m2= model.predict_proba(mytest)

   # Now suppose  m1[3] = [0.6, 0.2, 0.2]

Suppose I use both predict and predict_proba. If in index 3, I get the above result with the result of predict_proba, in index 3 of the result of predict I should see 0. Is this the case? I am trying to understand how using both predict and predict_proba on the same model relate to each other.

Please, instead of "*suppose*", post an actual code example of using both `predict` and `predict_proba`, so we can ground the discussion in an actual (and not hypothetical) case. — desertnaut, Apr 13 '20 at 09:50
Still unclear. `m1` is supposed to contain single numbers (classes), while here you show it as if containing probabilities. Please, take your time, focus, and update/clarify the question accordingly (the idea was to get rid of "*suppose*", by showing an actual example of **both** `predict` and `predict_proba` on the **same** test sample and focus the question on this, but you haven't done so). — desertnaut, Apr 13 '20 at 15:37
Possible duplicate: https://stackoverflow.com/questions/56397128/roc-auc-score-is-different-while-calculating-using-predict-vs-predict-proba — M.Mavini, Oct 16 '21 at 09:39

Giorgos Myrianthous · Accepted Answer · 2022-03-14T17:43:28.957

predict() is used to predict the actual class (in your case one of 0, 1, or 2).
predict_proba() is used to predict the class probabilities

From the example output that you shared,

predict() would output class 0 since the class probability for 0 is 0.6.
[0.6, 0.2, 0.2] is the output of predict_proba that simply denotes that the class probability for classes 0, 1, and 2 are 0.6, 0.2, and 0.2 respectively.

Now as the documentation mentions for predict_proba, the resulting array is ordered based on the labels you've been using:

The returned estimates for all classes are ordered by the label of classes.

Therefore, in your case where your class labels are [0, 1, 2], the corresponding output of predict_proba will contain the corresponding probabilities. 0.6 is the probability of the instance to be classified as 0 and 0.2 are the probabilities that the instance is categorised as 1 and 2 respectively.

For a more comprehensive explanation, refer to the article What is the difference between predict() and predict_proba() in scikit-learn on TDS.

@Giorgos, please note my question is regarding the relationship between exact indexes of these two. Also, I wonder if there is a typo in your answer, there are two ones as output of predict() — User 19826, Apr 13 '20 at 15:27

Difference between predict vs predict_proba in scikit-learn

1 Answers1

Linked