6

Here is my question, I hope someone can help me to figure it out..

To explain, there are more than 10 categorical columns in my data set and each of them has 200-300 categories. I want to convert them into binary values. For that I used first label encoder to convert string categories into numbers. The Label Encoder code and the output is shown below.

enter image description here

After Label Encoder, I used One Hot Encoder From scikit-learn again and it is worked. BUT THE PROBLEM IS, I need column names after one hot encoder. For example, column A with categorical values before encoding. A = [1,2,3,4,..]

It should be like that after encoding,

A-1, A-2, A-3

Anyone know how to assign column names to (old column names -value name or number) after one hot encoding. Here is my one hot encoding and it's output;

enter image description here

I need columns with name because I trained an ANN, but every time data comes up I cannot convert all past data again and again. So, I want to add just new ones every time. Thank anyway..

dss
  • 127
  • 1
  • 3
  • 7
  • 3
    Instead of scikit transformers, Use Dataframe.get_dummies(), which will automatically assign appropriate column names to them – Vivek Kumar Jul 13 '17 at 12:50
  • 1
    This may not be appropriate if you want to create an API or something where you would want to serialize the label and one hot encoder in order to be able to convert input data quickly into readable data by the model. – Tibor Udvari Feb 12 '18 at 13:29
  • When I used DataFrame.get_dummies, I got an error that states `AttributeError: 'DataFrame' object has no attribute 'get_dummies'` – Taylrl Dec 17 '18 at 14:58
  • the question would be better with code as _text_ not as image. – Jean-François Fabre Apr 12 '19 at 18:51

2 Answers2

0

As @Vivek Kumar mentioned, you can use the pandas function get_dummies() instead of OneHotEncoder. I wanted to preserve a version of my initial DataFrame so I did the folowing;

import pandas as pd
DataFrame2 = pd.get_dummies(DataFrame)
Taylrl
  • 3,601
  • 6
  • 33
  • 44
  • He knows and this is not what he's asking. You can't use get dummies when you are working with complex Pipelines and ColumnTransformers. – Odisseo Mar 24 '19 at 04:30
  • I must be missing something. Where is the need tor a complex Pipeline and I am suggesting using this instead of the ColumnTransformer – Taylrl Apr 17 '19 at 15:45
  • Well although you could use get dummies, applying all transformations within one single pipeline greatly increases model portability and reproducibility. Let’s say I’m going to recreate the model in a different environment. I can either redo get dummies and make sure it transforms the data the same way, or simply apply the same exact pipeline... – Odisseo Apr 17 '19 at 16:02
0

I used the following code to rename each one-hot encoded columns to "original name_one-hot encoded name". So for your example it would give A_1, A_2, A_3. Feel free to change the "_" below to "-".

#Create list of columns with "object" dtype
cat_cols = [col for col in df_pro.columns if df_pro[col].dtype == np.object]

#Find the array of new columns from one-hot encoding
cat_labels = ohenc.categories_

#Convert array of columns into list
cat_labels = np.concatenate(cat_labels).ravel().tolist()

#Use list comprehension to generate new list with labels needed    
cat_labels_new = [(col + "_" + label) for label in cat_labels for col in cat_cols if 
label in df_pro[col].values.tolist()]

#Create new DataFrame of transformed columns using new list labels
cat_ohc = pd.DataFrame(cat_arr, columns = cat_labels)

#Concat with original DataFrame and drop original columns (only columns with "object" dtype)
Ardent Coder
  • 3,777
  • 9
  • 27
  • 53
Timmy
  • 1
  • 1
  • 1