I have a pandas dataframe that looks like this
| col1 | col2
-------------------------------
0 | ['a', 'b', 'c'] | a
1 | ['b', 'd', 'e'] | g
and I want to know what is the most efficient way to get this dataframe
| col1 | col2 | a | b | c | d | e
---------------------------------------------------
0 | ['a', 'b', 'c'] | a | 1 | 1 | 1 | 0 | 0
1 | ['b', 'd', 'e'] | g | 0 | 1 | 0 | 1 | 1
I tried to use "apply" method, but it does not seems to be efficient for a dataframe of shape [40000, 100] (with col1 containing a set of 1k unique values)
Here is my code:
df = pd.DataFrame({'col1': [['a', 'b', 'c'], ['b', 'd', 'e']], 'col2': [2,5]})
s = set([item for sublist in df['col1'].values for item in sublist])
res = scipy.sparse.csr_matrix(df['col1'].apply(
lambda row: [1 if i in [item for item in row]
else 0 for i in s]).values.tolist())
Then res.toarray() gives me
array([[1, 1, 1, 0, 0], [0, 1, 0, 1, 1]], dtype=int64)
Does any one have a more efficient way of performing that?
In advance, thanks a lot!