I am new+ to Python. I am working on multi label classification and need to prepare target data for multi hot encoding. It has taken way more time than I had initially thought. This is not real data (I can't post it). The data has ID, Category that makes a row unique. So there are multiple rows for each ID for 10 categories in reality but I am mocking 4. There are bunch of other columns (predictors). Below is what I got best out of my couple of hours after trying many approaches:
test_data = pd.DataFrame()
test_data["Category"] = ['A','B','C','D','A','C']
test_data["ID"] = [1,1,3,4,5,6]
test_data =test_data.pivot(index='ID', columns="Category",
values='Category').reset_index()
test_data =test_data.fillna('0')
test_data = test_data.reset_index(drop=True).rename_axis(None, axis=1)
data = test_data.drop(['ID'], axis=1)
print(data)
ignore '=' below I don't know how to indent with space.
= A B C D
0 A B 0 0
1 0 0 C 0
2 0 0 0 D
3 A 0 0 0
4 0 0 C 0
As you can see I am filling the categories that are not present with dummy '0'.
data = data.astype(str).values
data
array(
[['A', 'B', '0', '0'],
['0', '0', 'C', '0'],
['0', '0', '0', 'D'],
['A', '0', '0', '0'],
['0', '0', 'C', '0']], dtype=object)
from sklearn.preprocessing import MultiLabelBinarizer
cat =['A','B','C','D','0']
mlb = MultiLabelBinarizer(cat)
mlb.fit_transform(data)
array(
[[1, 1, 0, 0, 1],
[0, 0, 1, 0, 1],
[0, 0, 0, 1, 1],
[1, 0, 0, 0, 1],
[0, 0, 1, 0, 1]])
There are two things I am looking for help:
How do I get rid of my dummy category encoding ('0')
It appears I am hacking my way through it. Is there a better way of doing it ?
Just for curious minds, I am using a fully connected neural network for this classification.
Thanks for your help.