2

I have a pandas dataframe that looks like this

    | col1              | col2
-------------------------------
0   | ['a', 'b', 'c']   |   a
1   | ['b', 'd', 'e']   |   g

and I want to know what is the most efficient way to get this dataframe

    | col1              | col2 | a | b | c | d | e
---------------------------------------------------
0   | ['a', 'b', 'c']   |   a  | 1 | 1 | 1 | 0 | 0 
1   | ['b', 'd', 'e']   |   g  | 0 | 1 | 0 | 1 | 1

I tried to use "apply" method, but it does not seems to be efficient for a dataframe of shape [40000, 100] (with col1 containing a set of 1k unique values)

Here is my code:

df = pd.DataFrame({'col1': [['a', 'b', 'c'], ['b', 'd', 'e']], 'col2': [2,5]})
s = set([item for sublist in df['col1'].values for item in sublist])
res = scipy.sparse.csr_matrix(df['col1'].apply(
                lambda row: [1 if i in [item for item in row] 
                               else 0 for i in s]).values.tolist())

Then res.toarray() gives me

array([[1, 1, 1, 0, 0], [0, 1, 0, 1, 1]], dtype=int64)

Does any one have a more efficient way of performing that?

In advance, thanks a lot!

Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
dark
  • 256
  • 2
  • 12

0 Answers0