4

I have the following code to one-hot-encode 2 columns I have.

# encode city labels using one-hot encoding scheme
city_ohe = OneHotEncoder(categories='auto')
city_feature_arr = city_ohe.fit_transform(df[['city']]).toarray()
city_feature_labels = city_ohe.categories_
city_features = pd.DataFrame(city_feature_arr, columns=city_feature_labels)

phone_ohe = OneHotEncoder(categories='auto')
phone_feature_arr = phone_ohe.fit_transform(df[['phone']]).toarray()
phone_feature_labels = phone_ohe.categories_
phone_features = pd.DataFrame(phone_feature_arr, columns=phone_feature_labels)

What I'm wondering is how I do this in 4 lines while getting properly named columns in the output. That is, I can create a properly one-hot-encoded array by include both columns names in fit_transform but when I try and name the resulting dataframe's columns, it tells me that there is a mismatch between the shape of the indices:

ValueError: Shape of passed values is (6, 50000), indices imply (3, 50000)

For background, both phone and city have 3 values.

    city    phone
0   CityA   iPhone
1   CityB Android
2   CityB iPhone
3   CityA   iPhone
4   CityC   Android
Python Developer
  • 551
  • 1
  • 8
  • 18

4 Answers4

12

You you are almost there... Like you said you can add all the columns you want to encode in fit_transform directly.

ohe = OneHotEncoder(categories='auto')
feature_arr = ohe.fit_transform(df[['phone','city']]).toarray()
feature_labels = ohe.categories_

And then you just need to do the following:

feature_labels = np.array(feature_labels).ravel()

Which enables you to name your columns like you wanted:

features = pd.DataFrame(feature_arr, columns=feature_labels)
MaximeKan
  • 4,011
  • 11
  • 26
  • 1
    thanks! I should have found that ravel() function. it's super helpful. – Python Developer Mar 19 '19 at 01:28
  • 2
    @MaximeKan I'm having problems when using the dataframe creation with more than one feature it returns me an error Shape of passed values ​​is (10692, 7), indices imply (10692, 2), I'm having to do the feature labels manually, how do I to solve this – Vitor Gonçalves Jun 25 '20 at 19:11
  • 2
    @VitorGonçalves This happens because the returning dataset from `fit_transform` has 7 columns after transform and therefore Pandas expects 7 corresponding labels in the `feature_labels` array to match with the dataset, but it only has 2 elements. To fix this error, replace `feature_labels = ohe.categories_` with `feature_labels = ohe.get_feature_names()` – Pavindu Jun 16 '21 at 06:36
1

Why don't you take a look at pd.get_dummies? Here's how you can encode:

df['city'] = df['city'].astype('category')
df['phone'] = df['phone'].astype('category')
df = pd.get_dummies(df)
panktijk
  • 1,574
  • 8
  • 10
1

this solution gives column names same as in pd.get_dummies(), what is useful IMO

labels = ['Sex', 'Embarked', 'Pclass']

categorical_data = data[labels]

ohe = OneHotEncoder(categories='auto')

feature_arr = ohe
   .fit_transform(categorical_data)
   .toarray()

ohe_labels = ohe.get_feature_names(labels)

features = pd.DataFrame(
               feature_arr,
               columns=ohe_labels)
  • DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. – Joe Ferndz Aug 04 '23 at 07:47
0
cat_features = [
    "gender", "cholesterol", "gluc", "smoke", "alco"
]

data = pd.get_dummies(data, columns = cat_features)
naimur978
  • 144
  • 8