For a multiclass problem, I can use sklearn's RandomForestClassifier out of the box:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
data = load_iris(as_frame=True)
y = data['target']
X = data['data']
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2,random_state=42)
RF = RandomForestClassifier(random_state=42)
RF.fit(X_train, y_train)
y_score = RF.predict_proba(X_test) # shape is (30, 3)
Or I can replace the above
RF = RandomForestClassifier(random_state=42)
RF.fit(X_train, y_train)
y_score = RF.predict_proba(X_test) # shape is (30, 3)
with this to train 3 binary-classification models:
from sklearn.multiclass import OneVsRestClassifier
RF = OneVsRestClassifier(RandomForestClassifier(random_state=42))
RF.fit(X_train, y_train)
y_score = RF.predict_proba(X_test) # shape is (30, 3)
I can then go on and binarize the output and use y_score
to plot ROC curves as per the official docs.
I am unsure which approach to take: the standard RandomForest multiclass approach, or the OneVsRest approach? For some models like SupportVectorClassifiers, one must use OvO or OvR for multiclass. However, RandomForest is different since the multiclass approach is native to the algorithm.
What makes me lean towards OvR in this case is that in the official docs is written:
ROC curves are typically used in binary classification to study the output of a classifier
and OvR is binary classification...