What should be the format of one-hot-encoded features for scikit-learn?

Question

I am trying to use the regressor/classifiers of scikit-learn library. I am a bit confused about the format of the one-hot-encoded features since I can send dataframe or numpy arrays to the model. Say I have categorical features named 'a', 'b' and 'c'. Should I give them in separate columns (with pandas.get_dummies()), like below:

a	b	c
1	1	1
1	0	1
0	0	1

or like this (merged all)

merged
1,1,1
1,0,1
0,0,1

And how to tell to the scikit-learn model that these are one-hot-encoded categorical features?

DV82XL · Accepted Answer · 2021-09-19T18:02:44.120

You can't pass a feature containing a merged list directly to the model. You should one-hot encode into separate columns first:

If you just want something quick and easy, get_dummies is fine for development, but the following approaches are generally preferred by most sources I've read.
If you want to encode your input data, use OneHotEncoder (OHE) to encode one or more columns, then merge with your other features. OHE gives good control over output format, stores intermediate data and has error handling. Good for production.
If you need to encode a single column, typically but not limited to labels, use LabelBinarizer to one-hot encode a column with a single value, or use MultiLabelBinarizer to one-hot encode a column with multiple values.

Once you have your one-hot encoded data/labels, you don't need to "tell" the model that certain features are one-hot. You just train the model on the data set using clf.fit(X_train, y_train) and make predictions using clf.predict(X_test).

OHE example

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

X = [['Male', 1], ['Female', 3], ['Female', 2]]
ohe = OneHotEncoder(handle_unknown='ignore')
X_enc = ohe.fit_transform(X).toarray()

# Convert to dataframe if you need to merge this with other features:
df = pd.DataFrame(X_enc, columns=ohe.get_feature_names())

MLB example

from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd

df = pd.DataFrame({
   'style': ['Folk', 'Rock', 'Classical'],
   'instruments': [['guitar', 'vocals'], ['guitar', 'bass', 'drums', 'vocals'], ['piano']]
})

mlb = MultiLabelBinarizer()
encoded = mlb.fit_transform(df['instruments'])
encoded_df = pd.DataFrame(encoded, columns=mlb.classes_, index=df['instruments'].index)

# Drop old column and merge new encoded columns
df = df.drop('instruments', axis=1)
df = pd.concat([df, encoded_df], axis=1, sort=False)

@MehmedB MLB is used to encode the labels. I added MLB into my answer for more clarity. Let me know if you still have questions. If you want to get more specific, please add a code sample to your question. — DV82XL, Sep 18 '21 at 23:48
So, finally, does that mean I can send a DataFrame in which the categorical features (one-hot-encoded) are in separate columns or not (for sklearn models)? or do I have to send one-hot-encoded array (each row merged into an array instead of seperate columns) — MehmedB, Sep 19 '21 at 16:55
They need to be one-hot encoded in separate columns and merged with any other features you have (for input matrix `X`). If you're one-hot encoding the labels (`y`), then you pass the labels encoded in separate columns in the `fit(X, y)` method (no need to merge to your feature matrix since labels are passed in a separately argument). — DV82XL, Sep 19 '21 at 17:35

What should be the format of one-hot-encoded features for scikit-learn?

1 Answers1