how to keep column's names after one hot encoding sklearn?

Question

I am working on the titanic kaggle competition, to deal with categorical data I’ve splited the data into 2 sets: one for numerical variables and the other for categorical variables. After working with sklearn one hot encoding on the set with categorical variables I tried the regroup the two datasets but since the categorical set is an ndarray and the other one is a dataframe I used:

np.hstack((X_train_num, X_train_cat))

which works perfectly but I no longer have the names of my variables.

Is there another way to do this while maintaining the names of the variables without using pd.get_dummies()?

Thanks

score 5 · Accepted Answer · answered May 18 '18 at 15:39

5

Try

X_train = X_train_num.join(
   pd.DataFrame(X_train_cat, X_train_num.index).add_prefix('cat_')
)

answered May 18 '18 at 15:39

piRSquared

285,575
57
475
624

1

I think this is the better answer. (Well, I think you should use `pd.get_dummies`, but asides from that, this is the better answer.) – Ami Tavory May 18 '18 at 15:44
Thanks @AmiTavory – piRSquared May 18 '18 at 15:45

score 3 · Answer 2 · answered May 18 '18 at 15:42

Well, as you stated in your question, there's pd.get_dummies, which I think is the best choice here. Having said that, you could use

pd.concat([X_train_num, pd.DataFrame(X_train_cat, index=X_train_num.index)], axis=1)

If you like, you could give also useful column names with

pd.concat([X_train_num, pd.DataFrame(X_train_cat, index=X_train_num.index, columns=cols)], axis=1)

and cols can be whatever list of strings you want (of the appropriate length).

score 1 · Answer 3 · answered Jul 22 '20 at 03:50

Adding columns in sklearn onehot encoder

from sklearn.preprocessing import OneHotEncoder
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(dev_data[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(test_data[object_cols]))

# Adding column names to the encoded data set.
OH_cols_train.columns = OH_encoder.get_feature_names(object_cols)
OH_cols_valid.columns = OH_encoder.get_feature_names(object_cols)

# One-hot encoding removed index; put it back
OH_cols_train.index = dev_data.index
OH_cols_valid.index = test_data.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = dev_data.drop(object_cols, axis=1)
num_X_valid = test_data.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
dev_data = pd.concat([num_X_train, OH_cols_train], axis=1)
test_data = pd.concat([num_X_valid, OH_cols_valid], axis=1)

how to keep column's names after one hot encoding sklearn?

3 Answers3