0

I am new+ to Python. I am working on multi label classification and need to prepare target data for multi hot encoding. It has taken way more time than I had initially thought. This is not real data (I can't post it). The data has ID, Category that makes a row unique. So there are multiple rows for each ID for 10 categories in reality but I am mocking 4. There are bunch of other columns (predictors). Below is what I got best out of my couple of hours after trying many approaches:

test_data = pd.DataFrame()
test_data["Category"] = ['A','B','C','D','A','C']
test_data["ID"] = [1,1,3,4,5,6]
test_data =test_data.pivot(index='ID', columns="Category", 
values='Category').reset_index()
test_data =test_data.fillna('0')
test_data = test_data.reset_index(drop=True).rename_axis(None, axis=1)
data = test_data.drop(['ID'], axis=1)
print(data)

ignore '=' below I don't know how to indent with space.

= A B C D

0 A B 0 0

1 0 0 C 0

2 0 0 0 D

3 A 0 0 0

4 0 0 C 0

As you can see I am filling the categories that are not present with dummy '0'.

data = data.astype(str).values 
data

array(

[['A', 'B', '0', '0'],

['0', '0', 'C', '0'],

['0', '0', '0', 'D'],

['A', '0', '0', '0'],

['0', '0', 'C', '0']], dtype=object)

from sklearn.preprocessing import MultiLabelBinarizer
cat =['A','B','C','D','0']
mlb = MultiLabelBinarizer(cat)
mlb.fit_transform(data)

array(

[[1, 1, 0, 0, 1],

[0, 0, 1, 0, 1],

[0, 0, 0, 1, 1],

[1, 0, 0, 0, 1],

[0, 0, 1, 0, 1]])

There are two things I am looking for help:

  1. How do I get rid of my dummy category encoding ('0')

  2. It appears I am hacking my way through it. Is there a better way of doing it ?

Just for curious minds, I am using a fully connected neural network for this classification.

Thanks for your help.

user8716498
  • 11
  • 2
  • 5

0 Answers0