1

I know that as features ordinal data could be assigned arbitrary numbers and OneHotEncoding could be done for categorical data. But I am a bit confused how these two types of data should be handled when they are the feature to be predicted. For instance in the iris dataset in scikitlearn:

iris = datasets.load_iris()
X = iris.data
y = iris.target

while the y represent three type of flowers which is a categorical data (if im not wrong?!), it is encoded as ordinal values of 0,1,2 (type=int32). My dataset also includes 3 independent categories ('sick','carrier','healthy') and scikitlearn accept them as as strings without any type of encoding.

I was wondering whether it is correct to keep them as they are to be used by scikitlearn or similar encoding as it is done for iris dataset is required?

Masih
  • 920
  • 2
  • 19
  • 36

2 Answers2

0

You don't need to encode your label. scikitlearn takes care of it. Same table used to build a classifier:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
X, y = load_iris(return_X_y=True)
clf = LogisticRegression(random_state=0).fit(X, y)
clf.predict(X[:2, :])
clf.predict_proba(X[:2, :])
clf.score(X, y)

and I just make a smaller table and change labeles from integer to string:

X1 = X[:5]
y1 = y[:5]
y1 = ['a', 'a', 'a','b', 'a']
clf = LogisticRegression(random_state=0).fit(X1, y1)
clf.predict(X1[:2, :])
clf.predict_proba(X1[:2, :])
clf.score(X1, y1)

and all good.

MTT
  • 5,113
  • 7
  • 35
  • 61
  • Thanks for answer. but I just still dont understand how encoding the labels as nominal can act the same as categorical as obviously they have different interpretation. In the case of Iris, does scikitlearn interpret the labels as categorical or nominal? – Masih Jan 30 '20 at 18:15
-2

It seems that in ML we are either working with continuous data that will be handled by regression models or they are categorical which will be handled by classification models. There is no separate category for ordinal data.

Masih
  • 920
  • 2
  • 19
  • 36
  • This is absolutely not true! In ML there are algorithms that handle an ordinal label (OLS implemented in statsmodels, for example) , and also one can utilize multi-label or multi-class models for training a model with an ordinal label (https://towardsdatascience.com/simple-trick-to-train-an-ordinal-regression-with-any-classifier-6911183d2a3c and https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html). – Serendipity Apr 09 '22 at 11:45