distance to multiple irregular sampled events in pandas

Question

I have an irregular sampled timeseries

                        event
Time                         
2013-01-01 01:40:53.072     n
2013-01-01 01:41:25.563     e
2013-01-01 01:51:23.293     e
2013-01-01 01:57:14.168     e
2013-01-01 01:58:07.273     e
2013-01-01 02:05:36.250     e
2013-01-01 02:35:08.501     e
2013-01-01 02:37:36.498     e
2013-01-01 03:22:15.091     e
2013-01-01 03:35:58.140     e
2013-01-01 03:39:47.682     e
2013-01-01 04:22:18.756     e
2013-01-01 04:33:08.892     e
2013-01-01 04:43:17.985     n
2013-01-01 04:49:49.281     e
2013-01-01 05:10:26.957     e
2013-01-01 05:17:15.411     e
2013-01-01 06:11:15.033     e
2013-01-01 06:46:36.406     e
2013-01-01 07:26:00.488     e

and I would like to calculate the cumulative elapsed time between each event n. There is a similar asked question (Pandas time series time between events), but i'm unable to adapt the solution to my problem since the irregular timeindex. My attempt is to using df1['diff']=df1.groupby('event_bool')['event_time'].diff() obtaining something like this:

                        event  event_bool            diff
Time                                                     
2013-01-01 01:40:53.072     n        True             NaT
2013-01-01 01:41:25.563     e       False             NaT
2013-01-01 01:51:23.293     e       False 00:09:57.730000
2013-01-01 01:57:14.168     e       False 00:05:50.875000
2013-01-01 01:58:07.273     e       False 00:00:53.105000
2013-01-01 02:05:36.250     e       False 00:07:28.977000
2013-01-01 02:35:08.501     e       False 00:29:32.251000
2013-01-01 02:37:36.498     e       False 00:02:27.997000
2013-01-01 03:22:15.091     e       False 00:44:38.593000
2013-01-01 03:35:58.140     e       False 00:13:43.049000
2013-01-01 03:39:47.682     e       False 00:03:49.542000
2013-01-01 04:22:18.756     e       False 00:42:31.074000
2013-01-01 04:33:08.892     e       False 00:10:50.136000
2013-01-01 04:43:17.985     n        True             NaT
2013-01-01 04:49:49.281     e       False 00:16:40.389000
2013-01-01 05:10:26.957     e       False 00:20:37.676000
2013-01-01 05:17:15.411     e       False 00:06:48.454000
2013-01-01 06:11:15.033     e       False 00:53:59.622000
2013-01-01 06:46:36.406     e       False 00:35:21.373000
2013-01-01 07:26:00.488     e       False 00:39:24.082000

However I have the following unresolved issues:

There is a NaT for the first event e after n. The result should be `00:00:32.491000``
Ho to cumulatively sum the elapsed time between nevents?

link to data file

Scott Boston · Accepted Answer · 2017-06-13T19:52:21.987

Let's try the following:

df = df.reset_index()

~~df_out = pd.concat([df,df.where(df['event'] == 'e').dropna()['Time'].diff().rename('diff')],axis=1)~~

df_out = pd.concat([df,df['Time'].diff().rename('diff').mask(df['event'] == 'n')],axis=1)

df_out['cum diff'] = df_out.groupby((df_out.event == 'n').cumsum())['diff'].transform(lambda x: x.fillna(0).cumsum())

df_out = df_out.set_index('Time')

Updated Output:

                         Time event            diff        cum diff
0  2013-01-01 01:40:53.072     n             NaT        00:00:00
1  2013-01-01 01:41:25.563     e 00:00:32.491000 00:00:32.491000
2  2013-01-01 01:51:23.293     e 00:09:57.730000 00:10:30.221000
3  2013-01-01 01:57:14.168     e 00:05:50.875000 00:16:21.096000
4  2013-01-01 01:58:07.273     e 00:00:53.105000 00:17:14.201000
5  2013-01-01 02:05:36.250     e 00:07:28.977000 00:24:43.178000
6  2013-01-01 02:35:08.501     e 00:29:32.251000 00:54:15.429000
7  2013-01-01 02:37:36.498     e 00:02:27.997000 00:56:43.426000
8  2013-01-01 03:22:15.091     e 00:44:38.593000 01:41:22.019000
9  2013-01-01 03:35:58.140     e 00:13:43.049000 01:55:05.068000
10 2013-01-01 03:39:47.682     e 00:03:49.542000 01:58:54.610000
11 2013-01-01 04:22:18.756     e 00:42:31.074000 02:41:25.684000
12 2013-01-01 04:33:08.892     e 00:10:50.136000 02:52:15.820000
13 2013-01-01 04:43:17.985     n             NaT        00:00:00
14 2013-01-01 04:49:49.281     e 00:06:31.296000 00:06:31.296000
15 2013-01-01 05:10:26.957     e 00:20:37.676000 00:27:08.972000
16 2013-01-01 05:17:15.411     e 00:06:48.454000 00:33:57.426000
17 2013-01-01 06:11:15.033     e 00:53:59.622000 01:27:57.048000
18 2013-01-01 06:46:36.406     e 00:35:21.373000 02:03:18.421000
19 2013-01-01 07:26:00.488     e 00:39:24.082000 02:42:42.503000

Nice! However, there is a shift in `diff` column that i'm not able to solve. The first `e` should be have a diff of `00:00:32.491000`, not NaT. Same for the first `e`after the second `n` it should be display `2013-01-01 04:49:49.281 - 2013-01-01 04:43:17.985 = 00:06:31.296` — lorenzo, Jun 13 '17 at 19:40
@lorenzo Oh... I was matching your output in the question. Does this look better? — Scott Boston, Jun 13 '17 at 19:52

score 0 · Answer 2 · answered Jun 13 '17 at 15:09

0

Not sure what a NaT is but you could use an fill method to replace all null values for the diff column. Then use a .sum() aggregation method.

answered Jun 13 '17 at 15:09

Caleb

1
4

score 0 · Answer 3 · answered Jun 13 '17 at 21:19

First I thought about a solution using a loop which goes as the following:

times = []
for index, row in df.iterrows():
if(row['event'] == 'n'):
    last = row['Time']
times.append(row['Time'] - last)

df['TimeNew'] = times

But then, I saw the other answer, and I tried to run some tests to see which perform better. I runned 10 times each method and took the average time :

Lines | Loop method (s) | lambda method (s) |
---------------------------------------------
21    | 0,006838305     | 0,013882545       |
504   | 0,092648337     | 0,056006076       |
1000  | 0,169315854     | 0,097687499       |
10000 | 1,414376600     | 0,746927508       |

Execution time by method

The answer posted here is indeed faster for more data. Which is not so suprising regarding performance of loop normally.

distance to multiple irregular sampled events in pandas

3 Answers3