0

I am looking for help with Groupby + Transform. I realize combined they are a very powerful tool, but am having trouble finding documentation/specific situations on what I want to accomplish. I would like to calculate the time delta based upon groups defined by multiple arguments. Specifically, I would like a column that displays the Assigned to Approved time difference by Casenumber AND Site AND only if there are both Assigned and Approval present. Please see below for my data set and code so far:

Dataset (start)

Casenumber Site  CreatedDate            NewValue
1          A     2018-03-30 16:47:03    Approved
2          A     2018-04-02 19:24:43    Assigned
2          A     2018-04-02 25:24:43    Delayed
2          A     2018-04-02 20:49:45    Approved
2          B     2018-04-02 19:24:43    Assigned
2          B     2018-04-02 22:49:45    Approved
2          B     2018-04-02 25:24:43    Delayed

Dataset (finish)

Casenumber Site  CreatedDate                NewValue   Diff
    1          A     2018-03-30 16:47:03    Approved   N/A
    2          A     2018-04-02 19:24:43    Assigned   0 days 0 hours
    2          A     2018-04-02 20:49:45    Approved   0 days 1:25:02
    2          A     2018-04-02 25:24:43    Delayed    N/A
    2          B     2018-04-02 19:24:43    Assigned   0 days 0 hours
    2          B     2018-04-02 22:49:45    Approved   0 days 3:25:02
    2          B     2018-04-02 25:24:43    Delayed    N/A

What I've got so far

(df['Diff'] = df['CreatedDate'] - 
(df.groupby(['Casenumber', 'Site'])['CreatedDate'].transform('first')))

Thank you for any help!

JJAN
  • 777
  • 1
  • 7
  • 13

2 Answers2

1

Make sure first you 'CreatedDate' column is datetime:

df['CreatedDate'] = pd.to_datetime(df['CreatedDate'])

Then this should work.

df['diff'] = df.sort_values(['Casenumber', 'Site']).groupby(['Casenumber', 'Site'])['CreatedDate'].diff()

Checkout this other post which could be helpful. How to calculate time difference by group using pandas?

Salma R
  • 194
  • 1
  • 12
  • Thank you for the comment and the referenced question is helpful. I updated my question to include additional detail. Is there any way to calculate the difference for only those groups that have both approved and assigned and only those two lines? – JJAN Apr 24 '18 at 21:38
  • i'm not sure if I understand your question right. But you could first have another df with only "approved" and "assigned" values: df2 = df.loc[df['New_value'].isin(['Approved', 'Assigned')] and then do the groupby. – Salma R Apr 25 '18 at 22:36
0

In the end, I used a pivot table format. Starting with a pivot of the data with a multi index.

restrdata = pd.pivot_table(restrdata, index =['Casenumber','OldValue','Site'], columns = 'NewValue', values = 'CreatedDate', aggfunc = 'first')

Then I began to create sub-pivot tables pulling any Casenumber that didn't have a null value for the response I was looking for.

AAdata = restrdata[(~restrdata['Approved: Admin Excep'].isnull()).groupby(['Casenumber']).transform('any')]

Finally, performing a calculation to get the difference between only the approved and assigned time.

AAdiff = (AAdata[~AAdata['Approved'].isnull()].reset_index(level=1)['Approved']
 - AAdata[~AAdata['Assigned'].isnull()].reset_index(level=1)['Assigned'])
JJAN
  • 777
  • 1
  • 7
  • 13