4

I am trying to understand a code block from a guided tutorial for the classic Iris Classification problem.

The code block for the final model is given as follows

chosen_model = SVC(gamma='auto')
chosen_model.fit(X_train,Y_train)
predictions = chosen_model.predict(X_valid)

In this image you can see the data types present in X_train and Y_train. These are Numpy arrays. Y_train contains the Iris species as string.

My question is simple: how come the model works even though I haven't One-Hot Encoded Y_train into different binary columns? My understanding from other tutorials is that for multi-class classification I need to first do one-hot encoding.

The code is working fine, I want to grasp when I need to One-Hot Encode and when it's not needed. Thank you!

yatu
  • 86,083
  • 12
  • 84
  • 139

2 Answers2

3

I think you might be confusing a multiclass (your case) with a multioutput classification.

In multiclass classification problems, your output should only be a single target column, and you'll be training the model to classify among the classes in that column. You'd have to split into separate target columns, in the case you had to predict n different classes per sample, which is not the case, you only want one of the targets per sample.

So for multiclass classification, there's no need to OneHotEncode the target, since you only want a single target column (which can also be categorical in SVC). What you do have to encode, either using OneHotEncoder or with some other encoders, is the categorical input features, which have to be numeric.

Also, SVC can deal with categorical targets, since it LabelEncode's them internally:

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
y_train_categorical = load_iris()['target_names'][y_train]
# array(['setosa', 'setosa', 'versicolor',...

sv = SVC()
sv.fit(X_train, y_train_categorical)
sv.classes_
# array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
yatu
  • 86,083
  • 12
  • 84
  • 139
  • This is perfect @yatu. Just so I understand - `SVC` can handle categorical outputs, but there are other classifiers that DO need the output to be OneHotEncoded? I'm thinking a basic FNN. – Arvind Raghavan Jul 15 '20 at 20:00
  • 1
    No @ArvindRaghavan , for multiclass classification, with other models you'll generally have to use a [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html), which encodes the classes into integers not a OneHotEncoder – yatu Jul 16 '20 at 07:02
-1

As far as I know, one hot encoding is never done on the output. You need to do one hot encoding on a feature so that the model never confuses that some color is greater than other colors. When you are computing the output the models use probability distributions based on classes. So there won't be any problem here.

In a nutshell, you should do one hot encoding only on the input features and not on the output classes.

S.Hemanth
  • 63
  • 1
  • 10
  • 1
    For a classification problem, targets variables need to be encoded into numerical form, either to one hot encoding or to numeric. SVC can work with catagorial values because it internally encodes them. But in general you need to encode the targets. – mujjiga Jul 15 '20 at 09:46