I have a dataset containing event occurrences, with a start time and duration. I'm trying to get this into a boolean format (i.e. True value when the event was occurring and a False when it was not).
Example dataframe:
time | event_id | duration |
---|---|---|
2 | 1 | 3 |
3 | 2 | 15 |
7 | 1 | 4 |
14 | 3 | 6 |
and my desired output would be:
time | event_1 | event_2 | event3 |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 |
3 | 1 | 1 | 0 |
4 | 1 | 1 | 0 |
5 | 0 | 1 | 0 |
... | ... | ... | ... |
16 | 0 | 1 | 1 |
17 | 0 | 1 | 1 |
18 | 0 | 0 | 1 |
19 | 0 | 0 | 1 |
20 | 0 | 0 | 0 |
My solution at the moment is this:
df = pd.DataFrame(columns=["time", "event_id", "duration"], data=[[2,1,3], [3,2,15], [7,1,4], [14,3,6]])
max_time = 20 # this is an external value but can assume no events happen after this
df = df.pivot(index="time", columns="event_id", values="duration")
t = np.arange(0, max_time+1, 1)
df = df.reindex(t)
for col in df:
not_null = df.loc[~df[col].isna(), col]
for start, length in not_null.items():
df.loc[start:start+int(length)-1, col] = 1
df = df.fillna(0).astype(int)
This works but because I'm nesting a loop it doesn't feel very scalable, and I want to be able to log hundreds of events over spans of several hours. Is there way I can do this with pandas functions to parallelise this more?