-2

Suppose that we have this data frame:

ID CATEGORIES
0 ['A']
1 ['A', 'C']
2 ['B', 'C']

And I want to apply one hot encoder to categories column. The result I want is

ID A B C
0 1 0 0
1 1 0 1
2 0 1 1

I know it can be easily codded. I just want to know if this function is already implemented in some package. Code it in python will probably result in a quite slow function.

(i needed to put the tables in code fields because stackoverflow was not allowing me to post it as tables)

Marcin Orlowski
  • 72,056
  • 11
  • 123
  • 141
sbb
  • 144
  • 8
  • 1
    Rather than suggesting a library such as [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), try using it and if you do not get the desired output edit the question. – Giuseppe La Gualano Nov 14 '22 at 15:08
  • 1
    Please refer following https://stackoverflow.com/questions/74015943/split-pandas-column-of-lists-into-multiple-columns-based-on-value/74016030#74016030 – R. Baraiya Nov 14 '22 at 15:13

1 Answers1

1

You can use str.join combined with str.get_dummies:

out = df[['ID']].join(df['CATEGORIES'].str.join('|').str.get_dummies())

Output:

   ID  A  B  C
0   0  1  0  0
1   1  1  0  1
2   2  0  1  1

used input:

df = pd.DataFrame({'ID': [0, 1, 2],
                   'CATEGORIES': [['A'], ['A', 'C'], ['B', 'C']]})

There are many other alternatives, using pivot, crosstab, etc.

One example:

df2 = df.explode('CATEGORIES')

out = pd.crosstab(df2['ID'], df2['CATEGORIES']).reset_index()
mozway
  • 194,879
  • 13
  • 39
  • 75