How to calculate day count from dates in different rows (Time in state calculation)

Question

Say I have a dataframe that looks like this:

Event           Transition Date        From             To  
ESS-123     2018-12-28 10:27:35.913  Planning       In Progress
ESS-123     2019-04-20 10:51:07.857  In Progress    Deferred
HTO-254     2019-04-04 15:19:06.013  Planning       In Progress
HTO-254     2019-05-05 10:35:03.083  In Progress    Not in work
HTO-254     2019-05-24 10:55:12.280  Work Planned   In Progress
AEW-8465    2018-10-01 09:40:34.070  Work Planned   In Progress
AEW-8465    2018-10-01 10:30:07.993  In Progress    Suspended
AEW-8465    2018-10-25 12:22:13.203  Drafted        In Progress
AEW-8465    2019-02-05 15:58:13.900  In Progress    Suspended

I want to be able to calculate the time in state of 'In Progress' for each Event. So I have to be able to filter by Event, aggregate a time in state, and add it to a new column. I can't even wrap my head around where to start. Any help by anyone would be greatly appreciated.

Take a look at ```groupby``` and ```shift```. Example in this question https://stackoverflow.com/questions/53335567/use-pandas-shift-within-a-group. — Eric Truett, Apr 29 '20 at 16:11
That's exactly what I needed along with a mask to make this work. The answer below used the same method. Thank you for your help! — ZebraDonkey, Apr 29 '20 at 17:31

It_is_Chris · Accepted Answer · 2020-04-29T16:54:30.710

Just calculate the difference for every row and then filter if you want...this assumes that your frame is sorted appropriately

# create groupby object
g = df.groupby('Event')
# calculate the difference for each group
df['diff'] = g['Transition Date'].diff(periods=1)
# create a mask to filter based on your conditions
mask = ((g['To'].shift(0) == 'In Progress') & (g['From'].shift(-1) == 'In Progress')) |\
       ((g['From'].shift(0) == 'In Progress') & (g['To'].shift(1) == 'In Progress'))
print(df[mask])

      Event         Transition Date          From           To  \
0   ESS-123 2018-12-28 10:27:35.913      Planning  In Progress   
1   ESS-123 2019-04-20 10:51:07.857   In Progress     Deferred   
2   HTO-254 2019-04-04 15:19:06.013      Planning  In Progress   
3   HTO-254 2019-05-05 10:35:03.083   In Progress  Not in work   
5  AEW-8465 2018-10-01 09:40:34.070  Work Planned  In Progress   
6  AEW-8465 2018-10-01 10:30:07.993   In Progress    Suspended   
7  AEW-8465 2018-10-25 12:22:13.203       Drafted  In Progress   
8  AEW-8465 2019-02-05 15:58:13.900   In Progress    Suspended   

                      diff  
0                      NaT  
1 113 days 00:23:31.944000  
2                      NaT  
3  30 days 19:15:57.070000  
5                      NaT  
6   0 days 00:49:33.923000  
7  24 days 01:52:05.210000  
8 103 days 03:36:00.697000

or if you want the total time each event spent in progress

df[mask].groupby('Event')['diff'].sum()

Event
AEW-8465   127 days 06:17:39.830000
ESS-123    113 days 00:23:31.944000
HTO-254     30 days 19:15:57.070000
Name: diff, dtype: timedelta64[ns]

This is exactly what I needed! Thank you so much! I didn't even think about using a mask. — ZebraDonkey, Apr 29 '20 at 17:29

How to calculate day count from dates in different rows (Time in state calculation)

1 Answers1