2

I'm trying to compute the difference between rows of a column based upon the first and last date, by group. for example:

df = pd.DataFrame({'ID':["a","a","a","b","b","b"],
                   'start_yr':[2010,2013,2020,2009,2005,2019],
                   'amt':[10,40,30,50,60,100]})

should return

a 20   (30-10)
b 40   (100 - 60)

I've tried stuff like this, but i'm missing something for sure.

#df['diff'] = df.groupby('ID')['start_yr','amt'].transform(lambda x: (x.max()-x.min()))
df['diff'] = dThe f.groupby('ID')['start_yr'].transform(lambda x: (x.max()-x.min()))
nerdlyfe
  • 487
  • 7
  • 21

2 Answers2

4

First sorting both columns by DataFrame.sort_values and then subtract last with first value in GroupBy.agg:

df1 = (df.sort_values(['ID','start_yr'])
         .groupby('ID')['amt']
         .agg(lambda x: (x.iat[-1]-x.iat[0]))
         .reset_index())

Or subtract values with aggregate GroupBy.first and GroupBy.last:

df = df.sort_values(['ID','start_yr'])
g = df.groupby('ID')['amt']

df1 = g.last().sub(g.first()).reset_index()

Or for first and last values is possible use DataFrame.drop_duplicates:

df = df.sort_values(['ID','start_yr'])

df1 = (df.drop_duplicates('ID', keep='last').set_index('ID')['amt']
         .sub(df.drop_duplicates('ID').set_index('ID')['amt'])
         .reset_index())

print (df1)
  ID  amt
0  a   20
1  b   40

EDIT: Here is idea without sorting first by DataFrameGroupBy.idxmax DataFrameGroupBy.idxmin for values by index (here amt) by minimal and maximal of start_yr:

g = df.set_index('amt').groupby('ID')['start_yr']

s = g.idxmax().sub(g.idxmin()).reset_index()
print (s)
  ID  start_yr
0  a        20
1  b        40
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • this is perfect thx. so there is no way to use the date explicitly to refer to the other column. sorting first is still efficient? notably, i'm using. an actual date field. not an int as a date. i should have added that. – nerdlyfe Dec 15 '20 at 06:48
  • 1
    @nerdlyfe - If dates are datetimes, sorting working perfectly. – jezrael Dec 15 '20 at 06:48
  • @nerdlyfe - Interesting question about performance. I think depends of data, also you get me idea for soluton without sorting, added to answer. Thank you. – jezrael Dec 15 '20 at 06:57
2

Let's try two steps:

s = df.sort_values(['ID','start_yr']).groupby(['ID'])['amt'].agg(['first','last'])
output = s['last'] - s['first']

Output:

ID
a    20
b    40
dtype: int64
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74