0

I have a dataset stored in a (10000, 15) numpy array. Columns [1, 5, 6, 7, 8, 9, 13, 14] are all categorical, while the rest are numerical data. I need to change the categorical data to numerical to be able to use it in the models (using sklearn).

I have attempted to use OneHotEncoder from the sklearn.preprocessing library,

# train = subset of data used for training, shape (7000, 15)
cat_columns = [1, 5, 6, 7, 8, 9, 13, 14]
ohe = OneHotEncoder(sparse_output=False)
train_encoded = ohe.fit_transform(train[:, cat_columns)
train[:, cat_columns] = train_encoded

Unfortunately, this doesn't work because the data changes shape after being encoded. I would appreciate any suggestions on how to turn this categorical data into numerical.

Here is an example of the first 3 rows of data, the last feature is what will be split off and predicted later on.

[['39' 'State-gov' '77516' 'Bachelors' '13' 'Never-married'
  'Adm-clerical' 'Not-in-family' 'White' 'Male' '2174' '0' '40'
  'United-States' '<=50K']
 ['50' 'Self-emp-not-inc' '83311' 'Bachelors' '13' 'Married-civ-spouse'
  'Exec-managerial' 'Husband' 'White' 'Male' '0' '0' '13' 'United-States'
  '<=50K']
 ['38' 'Private' '215646' 'HS-grad' '9' 'Divorced' 'Handlers-cleaners'
  'Not-in-family' 'White' 'Male' '0' '0' '40' 'United-States' '<=50K']
]
vito
  • 77
  • 7
  • 1
    Can you explain "the data changes shape after being encoded"? – Dr. Snoopy Mar 22 '23 at 08:29
  • I do not understand what do you want to one-hot-encode. Do you have 15 different categorical variables? – Salvatore Daniele Bianco Mar 22 '23 at 08:49
  • @SalvatoreDanieleBianco I am looking to encode the columns that contain categorical features. For example, in the sample data I gave, column 3 is a category of education level, so I would like to encode it to be able to use when training and fitting various ML classifiers. – vito Mar 22 '23 at 18:01
  • @Dr.Snoopy I am unable to just reassign the columns I've selected to the encoded columns because the encoded columns have a different shape from the original. – vito Mar 22 '23 at 18:02
  • You need to decide which numerical encoding algorithm you need. When your categories are ordered you can use `OrdinalEncoder`, when they are not, `OneHotEncoder` is recommended to avoid bias in your prediction. More information [here](https://stackoverflow.com/questions/69052776/ordinal-encoding-or-one-hot-encoding). – Mattravel Mar 23 '23 at 01:03
  • Since your categorical data doesn't seem ordered, I would recommend the use of `OneHotEncoder` over `OrdinalEncoder` (see my answer below). – Mattravel Mar 23 '23 at 01:07

3 Answers3

1

I don't think you understand the OneHotEncoder correctly. This here will answer your question far better than I can ever do.

OneHotEncoder turns your categorical values to bytes. Bytes are yes/no, true/false, 1/0, whatever; but always just two possibilities.

If your categorical values now are male/female --> Ok, you are good to go turning it to ones and zeros. If your categorical values are blue/red/green --> Damn, the one/zero representation is not enough.

That's why the OneHotEncoder turns every single value to an array! If you fit red/blue/green to the OneHotEncoder and then transform the value "blue" its representation will be [0, 1, 0], standing for 0 no red, 1 yes blue, 0 no green.

I think what you are looking for is the OrdinalEncoder. That one should not change the shape because it just replaces your "blue" with one number and not a byte array.

Anyway you'll find the OrdinalEncoder also at the Source that I linked you. It should give far more insights than my talk. Hope it helps

from sklearn.preprocessing import OrdinalEncoder

data = [['39', 'State-gov', '77516', 'Bachelors', '13', 'Never-married',
  'Adm-clerical', 'Not-in-family', 'White', 'Male', '2174', '0', '40',
  'United-States', '<=50K'],
 ['50', 'Self-emp-not-inc', '83311', 'Bachelors', '13', 'Married-civ-spouse',
  'Exec-managerial', 'Husband' ,'White', 'Male' ,'0', '0', '13' ,'United-States',
  '<=50K'],
 ['38' ,'Private', '215646' ,'HS-grad' ,'9', 'Divorced' ,'Handlers-cleaners',
  'Not-in-family' ,'White' ,'Male', '0', '0', '40', 'United-States' ,'<=50K']
]
df = pd.DataFrame(data)
print(df[cat_columns].shape)  # (3,8)

cat_columns = [1, 5, 6, 7, 8, 9, 13, 14]
ohe = OrdinalEncoder(sparse_output=False)
train_encoded = ohe.fit_transform(df[cat_columns])
print(train_encoded.shape)  # (3,8)
df[cat_columns] = train_encoded
Tarquinius
  • 1,468
  • 1
  • 3
  • 18
1

I understand that you want to use OneHotEncoder, but only on part of the array, not on all columns.
You can use np.c_ to add train_encoded to the non-encoded columns:

# train = subset of data used for training, shape (7000, 15)
cat_columns = [1, 5, 6, 7, 8, 9, 13, 14]
ohe = OneHotEncoder(sparse_output=False)

# Encoded columns
train_encoded = ohe.fit_transform(train[:, cat_columns])

# Non encoded columns
train_not_encoded = np.delete(train, cat_columns, 1)

# Merging the two
train_2 = np.c_[train_not_encoded , train_encoded]

Even easier is using pd.get_dummies, which allows you to specify which columns to encode:

import pandas as pd
df = pd.DataFrame(train)
train_2  = pd.get_dummies(df, columns=cat_columns)
Mattravel
  • 1,358
  • 1
  • 15
0

I would suggest to use the TargetEncoder from the category_encoders package.

from category_encoders import TargetEncoder

cat_columns = [1, 5, 6, 7, 8, 9, 13, 14]
enc = TargetEncoder(cols=cat_columns).fit(X, y)
# Where X is your exogenous variables pandas dataframe or array and y is your endogenous variable pandas series or array.

numeric_train = enc.transform(X)

This solution does not change the shape of the original data and it is a better solution thatn just using an ordinal encoding method (replacing variables with 0, 1, 2... with no criteria at all) because it takes into account the relationship between the categorical exogenous variables and the endogenous one. In the linked page there is a more detailed explanation.

Hope this helps!