31

Problem

I want to calculate diff by group. And I don’t know how to sort the time column so that each group results are sorted and positive.

The original data :

In [37]: df 
Out[37]:
  id                time
0  A 2016-11-25 16:32:17
1  A 2016-11-25 16:36:04
2  A 2016-11-25 16:35:29
3  B 2016-11-25 16:35:24
4  B 2016-11-25 16:35:46

The result I want

Out[40]:
   id   time
0  A   00:35
1  A   03:12
2  B   00:22

notice: the type of time col is timedelta64[ns]

Trying

In [38]: df['time'].diff(1)
Out[38]:
0                 NaT
1            00:03:47
2   -1 days +23:59:25
3   -1 days +23:59:55
4            00:00:22
Name: time, dtype: timedelta64[ns]

Don't get desired result.

Hope

Not only solve the problem but the code can run fast because there are 50 million rows.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Jack
  • 1,724
  • 4
  • 18
  • 33
  • 2
    You need to sort the data first `df = df.sort_values(['id','time'])`, then you can do `df.groupby('id')['time'].diff()` – EdChum Nov 25 '16 at 11:45

1 Answers1

52

You can use sort_values with groupby and aggregating diff:

df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
  id                time     diff
0  A 2016-11-25 16:32:17      NaT
1  A 2016-11-25 16:36:04 00:00:35
2  A 2016-11-25 16:35:29 00:03:12
3  B 2016-11-25 16:35:24      NaT
4  B 2016-11-25 16:35:46 00:00:22

If need remove rows with NaT in column diff use dropna:

df = df.dropna(subset=['diff'])
print (df)
  id                time     diff
2  A 2016-11-25 16:35:29 00:03:12
1  A 2016-11-25 16:36:04 00:00:35
4  B 2016-11-25 16:35:46 00:00:22

You can also overwrite column:

df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
  id     time
0  A      NaT
1  A 00:00:35
2  A 00:03:12
3  B      NaT
4  B 00:00:22

df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
df = df.dropna(subset=['time'])
print (df)
  id     time
1  A 00:00:35
2  A 00:03:12
4  B 00:00:22
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks for this trick. but If i needed something in only seconds? your answer will return something like `0 days 02:20:25` – LeMarque Jun 01 '22 at 13:51
  • i used and it worked... thanks Jezrael ... `df.time = df.sort_values(['id','time']).groupby('id')['time'].diff().dt.total_seconds()` – LeMarque Jun 01 '22 at 13:55