how to use pandas to subtract rows of a column based upon data by group?

Question

I'm trying to compute the difference between rows of a column based upon the first and last date, by group. for example:

df = pd.DataFrame({'ID':["a","a","a","b","b","b"],
                   'start_yr':[2010,2013,2020,2009,2005,2019],
                   'amt':[10,40,30,50,60,100]})

should return

a 20   (30-10)
b 40   (100 - 60)

I've tried stuff like this, but i'm missing something for sure.

#df['diff'] = df.groupby('ID')['start_yr','amt'].transform(lambda x: (x.max()-x.min()))
df['diff'] = dThe f.groupby('ID')['start_yr'].transform(lambda x: (x.max()-x.min()))

jezrael · Answer 1 · 2020-12-15T06:56:13.683

First sorting both columns by DataFrame.sort_values and then subtract last with first value in GroupBy.agg:

df1 = (df.sort_values(['ID','start_yr'])
         .groupby('ID')['amt']
         .agg(lambda x: (x.iat[-1]-x.iat[0]))
         .reset_index())

Or subtract values with aggregate GroupBy.first and GroupBy.last:

df = df.sort_values(['ID','start_yr'])
g = df.groupby('ID')['amt']

df1 = g.last().sub(g.first()).reset_index()

Or for first and last values is possible use DataFrame.drop_duplicates:

df = df.sort_values(['ID','start_yr'])

df1 = (df.drop_duplicates('ID', keep='last').set_index('ID')['amt']
         .sub(df.drop_duplicates('ID').set_index('ID')['amt'])
         .reset_index())

print (df1)
  ID  amt
0  a   20
1  b   40

EDIT: Here is idea without sorting first by DataFrameGroupBy.idxmax DataFrameGroupBy.idxmin for values by index (here amt) by minimal and maximal of start_yr:

g = df.set_index('amt').groupby('ID')['start_yr']

s = g.idxmax().sub(g.idxmin()).reset_index()
print (s)
  ID  start_yr
0  a        20
1  b        40

this is perfect thx. so there is no way to use the date explicitly to refer to the other column. sorting first is still efficient? notably, i'm using. an actual date field. not an int as a date. i should have added that. — nerdlyfe, Dec 15 '20 at 06:48
@nerdlyfe - If dates are datetimes, sorting working perfectly. — jezrael, Dec 15 '20 at 06:48
@nerdlyfe - Interesting question about performance. I think depends of data, also you get me idea for soluton without sorting, added to answer. Thank you. — jezrael, Dec 15 '20 at 06:57

score 2 · Accepted Answer · answered Dec 15 '20 at 06:42

2

Let's try two steps:

s = df.sort_values(['ID','start_yr']).groupby(['ID'])['amt'].agg(['first','last'])
output = s['last'] - s['first']

Output:

ID
a    20
b    40
dtype: int64

answered Dec 15 '20 at 06:42

Quang Hoang

146,074
10
56
74

how to use pandas to subtract rows of a column based upon data by group?

2 Answers2