1

I have an irregular sampled timeseries

                        event
Time                         
2013-01-01 01:40:53.072     n
2013-01-01 01:41:25.563     e
2013-01-01 01:51:23.293     e
2013-01-01 01:57:14.168     e
2013-01-01 01:58:07.273     e
2013-01-01 02:05:36.250     e
2013-01-01 02:35:08.501     e
2013-01-01 02:37:36.498     e
2013-01-01 03:22:15.091     e
2013-01-01 03:35:58.140     e
2013-01-01 03:39:47.682     e
2013-01-01 04:22:18.756     e
2013-01-01 04:33:08.892     e
2013-01-01 04:43:17.985     n
2013-01-01 04:49:49.281     e
2013-01-01 05:10:26.957     e
2013-01-01 05:17:15.411     e
2013-01-01 06:11:15.033     e
2013-01-01 06:46:36.406     e
2013-01-01 07:26:00.488     e

and I would like to calculate the cumulative elapsed time between each event n. There is a similar asked question (Pandas time series time between events), but i'm unable to adapt the solution to my problem since the irregular timeindex. My attempt is to using df1['diff']=df1.groupby('event_bool')['event_time'].diff() obtaining something like this:

                        event  event_bool            diff
Time                                                     
2013-01-01 01:40:53.072     n        True             NaT
2013-01-01 01:41:25.563     e       False             NaT
2013-01-01 01:51:23.293     e       False 00:09:57.730000
2013-01-01 01:57:14.168     e       False 00:05:50.875000
2013-01-01 01:58:07.273     e       False 00:00:53.105000
2013-01-01 02:05:36.250     e       False 00:07:28.977000
2013-01-01 02:35:08.501     e       False 00:29:32.251000
2013-01-01 02:37:36.498     e       False 00:02:27.997000
2013-01-01 03:22:15.091     e       False 00:44:38.593000
2013-01-01 03:35:58.140     e       False 00:13:43.049000
2013-01-01 03:39:47.682     e       False 00:03:49.542000
2013-01-01 04:22:18.756     e       False 00:42:31.074000
2013-01-01 04:33:08.892     e       False 00:10:50.136000
2013-01-01 04:43:17.985     n        True             NaT
2013-01-01 04:49:49.281     e       False 00:16:40.389000
2013-01-01 05:10:26.957     e       False 00:20:37.676000
2013-01-01 05:17:15.411     e       False 00:06:48.454000
2013-01-01 06:11:15.033     e       False 00:53:59.622000
2013-01-01 06:46:36.406     e       False 00:35:21.373000
2013-01-01 07:26:00.488     e       False 00:39:24.082000

However I have the following unresolved issues:

  1. There is a NaT for the first event e after n. The result should be `00:00:32.491000``
  2. Ho to cumulatively sum the elapsed time between nevents?

link to data file

lorenzo
  • 79
  • 4

3 Answers3

1

Let's try the following:

df = df.reset_index()

df_out = pd.concat([df,df.where(df['event'] == 'e').dropna()['Time'].diff().rename('diff')],axis=1)

df_out = pd.concat([df,df['Time'].diff().rename('diff').mask(df['event'] == 'n')],axis=1)

df_out['cum diff'] = df_out.groupby((df_out.event == 'n').cumsum())['diff'].transform(lambda x: x.fillna(0).cumsum())

df_out = df_out.set_index('Time')

Updated Output:

                         Time event            diff        cum diff
0  2013-01-01 01:40:53.072     n             NaT        00:00:00
1  2013-01-01 01:41:25.563     e 00:00:32.491000 00:00:32.491000
2  2013-01-01 01:51:23.293     e 00:09:57.730000 00:10:30.221000
3  2013-01-01 01:57:14.168     e 00:05:50.875000 00:16:21.096000
4  2013-01-01 01:58:07.273     e 00:00:53.105000 00:17:14.201000
5  2013-01-01 02:05:36.250     e 00:07:28.977000 00:24:43.178000
6  2013-01-01 02:35:08.501     e 00:29:32.251000 00:54:15.429000
7  2013-01-01 02:37:36.498     e 00:02:27.997000 00:56:43.426000
8  2013-01-01 03:22:15.091     e 00:44:38.593000 01:41:22.019000
9  2013-01-01 03:35:58.140     e 00:13:43.049000 01:55:05.068000
10 2013-01-01 03:39:47.682     e 00:03:49.542000 01:58:54.610000
11 2013-01-01 04:22:18.756     e 00:42:31.074000 02:41:25.684000
12 2013-01-01 04:33:08.892     e 00:10:50.136000 02:52:15.820000
13 2013-01-01 04:43:17.985     n             NaT        00:00:00
14 2013-01-01 04:49:49.281     e 00:06:31.296000 00:06:31.296000
15 2013-01-01 05:10:26.957     e 00:20:37.676000 00:27:08.972000
16 2013-01-01 05:17:15.411     e 00:06:48.454000 00:33:57.426000
17 2013-01-01 06:11:15.033     e 00:53:59.622000 01:27:57.048000
18 2013-01-01 06:46:36.406     e 00:35:21.373000 02:03:18.421000
19 2013-01-01 07:26:00.488     e 00:39:24.082000 02:42:42.503000
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Nice! However, there is a shift in `diff` column that i'm not able to solve. The first `e` should be have a diff of `00:00:32.491000`, not NaT. Same for the first `e`after the second `n` it should be display `2013-01-01 04:49:49.281 - 2013-01-01 04:43:17.985 = 00:06:31.296` – lorenzo Jun 13 '17 at 19:40
  • @lorenzo Oh... I was matching your output in the question. Does this look better? – Scott Boston Jun 13 '17 at 19:52
  • Works like a charm! Thanks @Scott – lorenzo Jun 13 '17 at 20:14
0

Not sure what a NaT is but you could use an fill method to replace all null values for the diff column. Then use a .sum() aggregation method.

Caleb
  • 1
  • 4
0

First I thought about a solution using a loop which goes as the following:

times = []
for index, row in df.iterrows():
if(row['event'] == 'n'):
    last = row['Time']
times.append(row['Time'] - last)

df['TimeNew'] = times

But then, I saw the other answer, and I tried to run some tests to see which perform better. I runned 10 times each method and took the average time :

Lines | Loop method (s) | lambda method (s) |
---------------------------------------------
21    | 0,006838305     | 0,013882545       |
504   | 0,092648337     | 0,056006076       |
1000  | 0,169315854     | 0,097687499       |
10000 | 1,414376600     | 0,746927508       |

Execution time by method

The answer posted here is indeed faster for more data. Which is not so suprising regarding performance of loop normally.