1

I have two dataframes (df1 and df2). df1 has all user IDs, purchase dates, and purchase amounts:

index     user     date_1        value
0         a        2021-02-05    50.0
1         b        2021-02-20    12.0
2         a        2021-03-06    45.0
3         c        2021-03-05    30.0
4         c        2021-03-15    44.0
...       ...      ...           ...

df2 has user IDs and the date they became a user:

index     user     date_2      
0         a        2021-02-05
1         b        2021-02-20
2         c        2021-03-06
3         d        2021-03-20
...       ...      ...

Goal: I am looking to add a new column to df2 that sums the value of each user's first month of transactions.

I found the following thread:

Pandas group by then count sum...

This keeps everything in a single dataframe, however, and I am getting lost trying to mix the apply, lambda, groupby, and sum functions. I have gotten to the following setup, but it returns incorrect values for the new column:

def f(x, y, t):
return x.apply(lambda y: x.loc[x['date_1'].between(y['date_2'],
                                                   y['date_2'] + t,
                                                    inclusive=False),'value'].sum(),axis=1)

df2['monthly_volume'] = df1.groupby('user', group_keys=False).
                                         apply(f, df2,
                                         pd.Timedelta(30, unit='D'))

Eventually, as I get more data, I would also look to find the purchase volume of users in their second month, third month, etc. after signing up. Maybe there is a better of going about this to make it repeatable in addition to fixing it. Appreciate the help!

Colin
  • 133
  • 1
  • 6

1 Answers1

1

You can do boolean indexing. For example:

def get_sum(user, start_date, end_date):
    return df1.loc[
        (df1.user == user) & (df1.date_1.between(start_date, end_date)),
        "value",
    ].sum()


df2["monthly_volume"] = df2.apply(
    lambda x: get_sum(
        x["user"], x["date_2"], x["date_2"] + pd.Timedelta(days=30)
    ),
    axis=1,
)
print(df2)

Prints:

  user     date_2  monthly_volume
0    a 2021-02-05            95.0
1    b 2021-02-20            12.0
2    c 2021-03-06            44.0
3    d 2021-03-20             0.0
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91