1

Here are the two series. They were generated after grouping a dataframe based on two column values, patient_id and event_description:

Series 1:

patient_id                                event_description     
A                                             DiagnosisA          2016-09-15
B                                             DiagnosisA          2013-01-01
C                                             DiagnosisA         2016-01-10
D                                             DiagnosisA          2015-01-02

Series 2:

patient_id                                 event_description     
A                                             DiagnosisB          2016-09-30
B                                             DiagnosisB          2013-01-04
D                                             DiagnosisB          2015-01-15

So I essentially Would like to find the difference in time between between each patient. For example for patient A it would return 15 days but for patient C, It could say `NaN.

madsthaks
  • 2,091
  • 6
  • 25
  • 46

3 Answers3

3

If the event_description column has no role to play here, I'd suggest just getting rid of it. Make a reset_index call and just subtract the resultants.

r1 = df1.reset_index(level=1, drop=1)
r2 = df2.reset_index(level=1, drop=1)

r = r2.sub(r1)

r
           time
patient_id        
A          15 days
B           3 days
C              NaT
D          13 days
cs95
  • 379,657
  • 97
  • 704
  • 746
3

Assuming your groupby included both DiagnosisB and DiagnosisA, you could unstack beforehand and subtract

# This should be what you have after groupby
s = pd.concat([s1, s2])

s

patient_id  event_description
A           DiagnosisA          2016-09-15
B           DiagnosisA          2013-01-01
C           DiagnosisA          2016-01-10
D           DiagnosisA          2015-01-02
A           DiagnosisB          2016-09-30
B           DiagnosisB          2013-01-04
D           DiagnosisB          2015-01-15
Name: Date, dtype: datetime64[ns]

Then you can unstack and subtract

s.unstack().pipe(lambda d: d.DiagnosisB - d.DiagnosisA)

patient_id
A   15 days
B    3 days
C       NaT
D   13 days
dtype: timedelta64[ns]

Not Recommended
But fun with unstack and squeeze

s1.unstack().squeeze().rsub(s2.unstack().squeeze())

patient_id
A   15 days
B    3 days
C       NaT
D   13 days
dtype: timedelta64[ns]

Also Not Recommended
But fun with xs

s1.xs('DiagnosisA', level=1).rsub(s2.xs('DiagnosisB', level=1))

patient_id
A   15 days
B    3 days
C       NaT
D   13 days
Name: Date, dtype: timedelta64[ns]

Final Non-Recommendation
Had to show pd.IndexSlice
Note: Indices need to be fully lexsorted for this to work.

s1.loc[pd.IndexSlice[:, 'DiagnosisA']].rsub(s2.loc[pd.IndexSlice[:, 'DiagnosisB']])

patient_id
A   15 days
B    3 days
C       NaT
D   13 days
Name: Date, dtype: timedelta64[ns]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

You can use reset_index, we can also use droplevel i.e

df2.index = df2.index.droplevel(1)
df1.index = df1.index.droplevel(1)

df2 - df1

Output :

           time
patient_id        
A          15 days
B           3 days
C              NaT
D          13 days
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108