You can't pass a feature containing a merged list directly to the model. You should one-hot encode into separate columns first:
- If you just want something quick and easy, get_dummies is fine for development, but the following approaches are generally preferred by most sources I've read.
- If you want to encode your input data, use OneHotEncoder (OHE) to encode one or more columns, then merge with your other features. OHE gives good control over output format, stores intermediate data and has error handling. Good for production.
- If you need to encode a single column, typically but not limited to labels, use LabelBinarizer to one-hot encode a column with a single value, or use MultiLabelBinarizer to one-hot encode a column with multiple values.
Once you have your one-hot encoded data/labels, you don't need to "tell" the model that certain features are one-hot. You just train the model on the data set using clf.fit(X_train, y_train)
and make predictions using clf.predict(X_test)
.
OHE example
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
X = [['Male', 1], ['Female', 3], ['Female', 2]]
ohe = OneHotEncoder(handle_unknown='ignore')
X_enc = ohe.fit_transform(X).toarray()
# Convert to dataframe if you need to merge this with other features:
df = pd.DataFrame(X_enc, columns=ohe.get_feature_names())
MLB example
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd
df = pd.DataFrame({
'style': ['Folk', 'Rock', 'Classical'],
'instruments': [['guitar', 'vocals'], ['guitar', 'bass', 'drums', 'vocals'], ['piano']]
})
mlb = MultiLabelBinarizer()
encoded = mlb.fit_transform(df['instruments'])
encoded_df = pd.DataFrame(encoded, columns=mlb.classes_, index=df['instruments'].index)
# Drop old column and merge new encoded columns
df = df.drop('instruments', axis=1)
df = pd.concat([df, encoded_df], axis=1, sort=False)