3

I have a pandas dataframe with 2 columns "type" and "sign" as follows

    type    sign
0   open    A
1   open    B
2   open    D
3   close   B
4   close   D
5   open    B
6   close   B
7   close   A

"A" + "open" means that event A has started happening. "A" + "close" means that event A has stopped. I need to vectorize it, when time is a major issue (the list is actually about 40 million lines long). Kind of like one-hot-encoding, but i need "1" to exist if and only if an event is "active". for example, for this case the results should be:

    A   B   C   D   type    sign
0   1   0   0   0   open    A
1   1   1   0   0   open    B
2   1   1   0   1   open    D
3   1   0   0   1   close   B
4   1   0   0   0   close   D
5   1   1   0   0   open    B
6   1   0   0   0   close   B
7   0   0   0   0   close   A

Ideas? thanks

Guy Barash
  • 470
  • 5
  • 17

2 Answers2

5

IIUC let do get_dummies then do cumsum

s=df.sign.str.get_dummies().reindex(columns=list('ABCD'),fill_value=0).\
    mul(df.type.map({'open':1,'close':-1}),axis=0).cumsum()
   A  B  C  D
0  1  0  0  0
1  1  1  0  0
2  1  1  0  1
3  1  0  0  1
4  1  0  0  0
5  1  1  0  0
6  1  0  0  0
7  0  0  0  0
df=df.join(s)
BENY
  • 317,841
  • 20
  • 164
  • 234
2

For your data, you can just do a for loop:

for sign in 'ABCD':
    df[sign] = (df['type'].where(df['sign'].eq(sign))
                 .ffill()
                 .eq('open')
                 .astype(int)
               )   

Output:

    type sign  A  B  C  D
0   open    A  1  0  0  0
1   open    B  1  1  0  0
2   open    D  1  1  0  1
3  close    B  1  0  0  1
4  close    D  1  0  0  0
5   open    B  1  1  0  0
6  close    B  1  0  0  0
7  close    A  0  0  0  0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • 22 seconds for 10 millions lines. good timing! can you think of a way to increase speed? i tried multiprocess on it, but just transfering the lines to other process cause it to become unusefull – Guy Barash Feb 18 '20 at 08:55