2

For a multiclass problem, I can use sklearn's RandomForestClassifier out of the box:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

data = load_iris(as_frame=True)

y = data['target']
X = data['data']
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=42)

RF = RandomForestClassifier(random_state=42)
RF.fit(X_train, y_train)
y_score = RF.predict_proba(X_test)  # shape is (30, 3)

Or I can replace the above

RF = RandomForestClassifier(random_state=42)
RF.fit(X_train, y_train)
y_score = RF.predict_proba(X_test)  # shape is (30, 3)

with this to train 3 binary-classification models:

from sklearn.multiclass import OneVsRestClassifier
RF = OneVsRestClassifier(RandomForestClassifier(random_state=42))
RF.fit(X_train, y_train)
y_score = RF.predict_proba(X_test)  # shape is (30, 3)

I can then go on and binarize the output and use y_score to plot ROC curves as per the official docs.

I am unsure which approach to take: the standard RandomForest multiclass approach, or the OneVsRest approach? For some models like SupportVectorClassifiers, one must use OvO or OvR for multiclass. However, RandomForest is different since the multiclass approach is native to the algorithm.

What makes me lean towards OvR in this case is that in the official docs is written:

ROC curves are typically used in binary classification to study the output of a classifier

and OvR is binary classification...

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Oliver Angelil
  • 1,099
  • 15
  • 31
  • I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info – desertnaut Aug 03 '22 at 10:08

0 Answers0