I have a dataset stored in a (10000, 15)
numpy array. Columns [1, 5, 6, 7, 8, 9, 13, 14]
are all categorical, while the rest are numerical data. I need to change the categorical data to numerical to be able to use it in the models (using sklearn).
I have attempted to use OneHotEncoder from the sklearn.preprocessing library,
# train = subset of data used for training, shape (7000, 15)
cat_columns = [1, 5, 6, 7, 8, 9, 13, 14]
ohe = OneHotEncoder(sparse_output=False)
train_encoded = ohe.fit_transform(train[:, cat_columns)
train[:, cat_columns] = train_encoded
Unfortunately, this doesn't work because the data changes shape after being encoded. I would appreciate any suggestions on how to turn this categorical data into numerical.
Here is an example of the first 3 rows of data, the last feature is what will be split off and predicted later on.
[['39' 'State-gov' '77516' 'Bachelors' '13' 'Never-married'
'Adm-clerical' 'Not-in-family' 'White' 'Male' '2174' '0' '40'
'United-States' '<=50K']
['50' 'Self-emp-not-inc' '83311' 'Bachelors' '13' 'Married-civ-spouse'
'Exec-managerial' 'Husband' 'White' 'Male' '0' '0' '13' 'United-States'
'<=50K']
['38' 'Private' '215646' 'HS-grad' '9' 'Divorced' 'Handlers-cleaners'
'Not-in-family' 'White' 'Male' '0' '0' '40' 'United-States' '<=50K']
]