1

I have a dataset containing event occurrences, with a start time and duration. I'm trying to get this into a boolean format (i.e. True value when the event was occurring and a False when it was not).

Example dataframe:

time event_id duration
2 1 3
3 2 15
7 1 4
14 3 6

and my desired output would be:

time event_1 event_2 event3
0 0 0 0
1 0 0 0
2 1 0 0
3 1 1 0
4 1 1 0
5 0 1 0
... ... ... ...
16 0 1 1
17 0 1 1
18 0 0 1
19 0 0 1
20 0 0 0

My solution at the moment is this:

df = pd.DataFrame(columns=["time", "event_id", "duration"], data=[[2,1,3], [3,2,15], [7,1,4], [14,3,6]])
max_time = 20  # this is an external value but can assume no events happen after this
df = df.pivot(index="time", columns="event_id", values="duration")
t = np.arange(0, max_time+1, 1)
df = df.reindex(t)

for col in df:
    not_null = df.loc[~df[col].isna(), col]
    for start, length in not_null.items():
        df.loc[start:start+int(length)-1, col] = 1

df = df.fillna(0).astype(int)

This works but because I'm nesting a loop it doesn't feel very scalable, and I want to be able to log hundreds of events over spans of several hours. Is there way I can do this with pandas functions to parallelise this more?

Apollo42
  • 49
  • 6

2 Answers2

2

Use Index.repeat with DataFrame.loc for duplicated values, add counter by GroupBy.cumcount, reshape by crosstab - if some puplicates values in real data compare by 0 and cast to integers, last add non exist rows by DataFrame.reindex with DataFrame.add_prefix:

Notice: If need count values instead set 1 remove .ne(0).astype(int)

max_time = 20  # this is an external value but can assume no events happen after this
t = np.arange(max_time+1)

df = df.loc[df.index.repeat(df['duration'])]
df['time'] = df['time'].add(df.groupby(level=0).cumcount())

df = (pd.crosstab(df["time"],df["event_id"]).ne(0).astype(int)
        .reindex(t, fill_value=0)
        .add_prefix('event'))
print (df)
event_id  event1  event2  event3
time                            
0              0       0       0
1              0       0       0
2              1       0       0
3              1       1       0
4              1       1       0
5              0       1       0
6              0       1       0
7              1       1       0
8              1       1       0
9              1       1       0
10             1       1       0
11             0       1       0
12             0       1       0
13             0       1       0
14             0       1       1
15             0       1       1
16             0       1       1
17             0       1       1
18             0       0       1
19             0       0       1
20             0       0       0

EDIT: Vectorized solution: Compare forward filled missing values by counter of all missing values:

#your solution
max_time = 20  # this is an external value but can assume no events happen after this
df = df.pivot(index="time", columns="event_id", values="duration")
t = np.arange(0, max_time+1, 1)
df = df.reindex(t)
print (df)
event_id    1     2    3
time                    
0         NaN   NaN  NaN
1         NaN   NaN  NaN
2         3.0   NaN  NaN
3         NaN  15.0  NaN
4         NaN   NaN  NaN
5         NaN   NaN  NaN
6         NaN   NaN  NaN
7         4.0   NaN  NaN
8         NaN   NaN  NaN
9         NaN   NaN  NaN
10        NaN   NaN  NaN
11        NaN   NaN  NaN
12        NaN   NaN  NaN
13        NaN   NaN  NaN
14        NaN   NaN  6.0
15        NaN   NaN  NaN
16        NaN   NaN  NaN
17        NaN   NaN  NaN
18        NaN   NaN  NaN
19        NaN   NaN  NaN
20        NaN   NaN  NaN

#https://stackoverflow.com/a/48659066/2901002
a = df.isna()
b = a.cumsum()
df1 = b.sub(b.mask(a).ffill()).lt(df.ffill()).astype(int).add_prefix('event')
print (df1)
event_id  event1  event2  event3
time                            
0              0       0       0
1              0       0       0
2              1       0       0
3              1       1       0
4              1       1       0
5              0       1       0
6              0       1       0
7              1       1       0
8              1       1       0
9              1       1       0
10             1       1       0
11             0       1       0
12             0       1       0
13             0       1       0
14             0       1       1
15             0       1       1
16             0       1       1
17             0       1       1
18             0       0       1
19             0       0       1
20             0       0       0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I appreciate the answer but unfortunately this runs significantly slower than my loop solution on my test data. It seems like crosstab has a lot of trouble as the durations get large – Apollo42 Mar 03 '23 at 10:11
  • @Apollo42 - Thank you for feedback. I create another solution, I hope faster. – jezrael Mar 03 '23 at 10:46
  • thank you very much, your new solution is much more performant and appears to scale far better than my loop as the number of events increase. I very much appreciate the time you have taken to answer my query. – Apollo42 Mar 03 '23 at 11:16
1

One approach could be:

df['rows'] = df[['time','duration']].apply(lambda row: list(range(row['time'], row['time']+row['duration'])), axis = 1)
df = df.explode('rows').pivot(index = 'rows', columns = 'event_id', values = 'time').notna().astype('int')

What this does is first create a new col rows which contains a list with a range of values that indicate int time period, these values are later used as index.

Then after exploding this column (i.e. creating a new row for each item in these lists), pivot the DataFrame to use 'rows' as index, 'event_id' as columns, and any other non-empty col as values. Empty values in the DataFrame after pivot mean there were no such records (and as such, the event was not present at that time period), so check for null values and convert result from bool to int.

The result:

rows    1   2   3
2   1   0   0
3   1   1   0
4   1   1   0
5   0   1   0
6   0   1   0
7   1   1   0
8   1   1   0
9   1   1   0
10  1   1   0
11  0   1   0
12  0   1   0
13  0   1   0
14  0   1   1
15  0   1   1
16  0   1   1
17  0   1   1
18  0   0   1
19  0   0   1
dm2
  • 4,053
  • 3
  • 17
  • 28
  • I appreciate the answer, however this solution runs slower than my original loop on my test data. I haven't profiled it but if I had to guess it would be down to the apply – Apollo42 Mar 03 '23 at 10:53