How to apply onehot encoder over vectorized dataframe columns?

Question

Suppose that we have this data frame:

ID	CATEGORIES
0	['A']
1	['A', 'C']
2	['B', 'C']

And I want to apply one hot encoder to categories column. The result I want is

ID	A	B	C
0	1	0	0
1	1	0	1
2	0	1	1

I know it can be easily codded. I just want to know if this function is already implemented in some package. Code it in python will probably result in a quite slow function.

(i needed to put the tables in code fields because stackoverflow was not allowing me to post it as tables)

Rather than suggesting a library such as [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), try using it and if you do not get the desired output edit the question. — Giuseppe La Gualano, Nov 14 '22 at 15:08
Please refer following https://stackoverflow.com/questions/74015943/split-pandas-column-of-lists-into-multiple-columns-based-on-value/74016030#74016030 — R. Baraiya, Nov 14 '22 at 15:13

score 1 · Accepted Answer · answered Nov 14 '22 at 15:06

You can use str.join combined with str.get_dummies:

out = df[['ID']].join(df['CATEGORIES'].str.join('|').str.get_dummies())

Output:

   ID  A  B  C
0   0  1  0  0
1   1  1  0  1
2   2  0  1  1

used input:

df = pd.DataFrame({'ID': [0, 1, 2],
                   'CATEGORIES': [['A'], ['A', 'C'], ['B', 'C']]})

There are many other alternatives, using pivot, crosstab, etc.

One example:

df2 = df.explode('CATEGORIES')

out = pd.crosstab(df2['ID'], df2['CATEGORIES']).reset_index()

How to apply onehot encoder over vectorized dataframe columns?

1 Answers1