Finding the average of values in a column and create a new dataframe that distributes the average

Question

I want to replace existing values in a column with the average values of the same column using python, preferably. I want to distribute the payments equally to all the months from the first month of payment until the last month. The average monthly payments should be distributed per cust_id and sub_id.

Payments may skip months and are not the same.

I hope you could help me on this as I am only beginning to learn python.

The data looks like this:

cust_id	sub_id	date	payment
1	A	12/1/20	200
1	A	2/2/21	200
1	A	2/3/21	100
1	A	5/1/21	200
1	B	1/2/21	50
1	B	1/9/21	20
1	B	3/1/21	80
1	B	4/23/21	90
2	C	1/4/21	200
2	C	1/9/21	300

The result I want is this:

cust_id	sub_id	date	payment
1	A	12/1/20	116.67
1	A	1/1/21	116.67
1	A	2/1/21	116.67
1	A	3/1/21	116.67
1	A	4/1/21	116.67
1	A	5/1/21	116.67
1	B	1/1/21	60
1	B	2/1/21	60
1	B	3/1/21	60
1	B	4/1/21	60
2	C	1/1/21	500

Thank you very much!

I need to distribute the payments to all the months including the months with no payment as long as they are between the first month of payment and the last month of payment. — Yam, Apr 01 '21 at 17:05
What have you tried? And where are you stuck? Please refer to this page on guidelines for asking questions: https://stackoverflow.com/help/how-to-ask — AlexK, Apr 01 '21 at 21:56
For cust_id=2 and sub_id='C', it appears you only have one month in the original table (Jan 2021). But in the output table, you create multiple months for that grouping. Please check. — AlexK, Apr 01 '21 at 22:06
Thanks! It was a typo on my part (cust_id=2 and sub_id='C'). — Yam, Apr 02 '21 at 13:39

score 1 · Answer 1 · answered Apr 01 '21 at 23:04

As noted in the comments your answer for cust_id=2 and sub_id='C' appears to be inconsistent with your requirements, so I go by the latter.

First, we aggregate dates into a min,max and payments into a sum:

df2 = df.groupby(['cust_id','sub_id']).agg({'date':[min,max], 'payment':sum})
df2.columns = df2.columns.get_level_values(1)
df2

and we get

        min         max         sum
cust_id sub_id          
1   A   2020-12-01  2021-05-01  700
    B   2021-01-02  2021-04-23  240
2   C   2021-01-04  2021-01-09  500

Then we create a monthly schedule for each row from min to max. Here you may have to fiddle with the dates a bit to have them nicely lined up, I just did the basics to show the idea:

from datetime import timedelta
df2['schedule'] = df2.apply(lambda row: pd.date_range(row['min'],row['max'] + timedelta(days = 31), freq = '1M'),axis=1)

Now df2 looks like this:


          min                  max                    sum  schedule
--------  -------------------  -------------------  -----  ---------------------------------------------------------------------------------------------------------
(1, 'A')  2020-12-01 00:00:00  2021-05-01 00:00:00    700  DatetimeIndex(['2020-12-31', '2021-01-31', '2021-02-28', '2021-03-31',
                                                                          '2021-04-30', '2021-05-31'],
                                                                         dtype='datetime64[ns]', freq='M')
(1, 'B')  2021-01-02 00:00:00  2021-04-23 00:00:00    240  DatetimeIndex(['2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30'], dtype='datetime64[ns]', freq='M')
(2, 'C')  2021-01-04 00:00:00  2021-01-09 00:00:00    500  DatetimeIndex(['2021-01-31'], dtype='datetime64[ns]', freq='M')

Now we explode our 'schedule' and allocate payments equally, and do some cleanup on column names etc:

df3 = df2.groupby(['cust_id','sub_id'], as_index = False).apply(lambda g: g.explode('schedule'))
(df3.groupby(['cust_id','sub_id'], as_index = False)
    .apply(lambda g: g.assign(sum = g['sum']/len(g)))
    .reset_index(drop = False)
    .drop(columns = ['min','max','level_0'])
    .rename(columns = {'sum':'payment'})
)

to get

      cust_id  sub_id      payment  schedule
--  ---------  --------  ---------  -------------------
 0          1  A           116.667  2020-12-31 00:00:00
 1          1  A           116.667  2021-01-31 00:00:00
 2          1  A           116.667  2021-02-28 00:00:00
 3          1  A           116.667  2021-03-31 00:00:00
 4          1  A           116.667  2021-04-30 00:00:00
 5          1  A           116.667  2021-05-31 00:00:00
 6          1  B            60      2021-01-31 00:00:00
 7          1  B            60      2021-02-28 00:00:00
 8          1  B            60      2021-03-31 00:00:00
 9          1  B            60      2021-04-30 00:00:00
10          2  C           500      2021-01-31 00:00:00

Thank you very much for your answer. I'll try this. Anyways, can you also help me with a related question? Here's the link: https://stackoverflow.com/q/66916016/15532072 — Yam, Apr 02 '21 at 08:44

score 1 · Accepted Answer · answered Apr 01 '21 at 23:28

This can be done in just a couple of steps using the resample() and transform() functions:

First, we add the missing months to the original table, changing all date values to the first of the month, combining rows for the same month with the original values of payment added, and putting 0's in the payment column in new rows:

resampled_df = (df
   .set_index('date')
   .groupby(['cust_id', 'sub_id'])
   .resample('MS')
   .agg({'payment': sum})
   .reset_index()
)

Then, we calculate the average across all months for each group and assign that average to every row in the group, assigning the result to a new column:

resampled_df['avg_monthly_payment'] = (resampled_df
   .groupby(['cust_id', 'sub_id'])['payment']
   .transform('mean')
)

Hey Alex! Can you also help me with another question? It's related with this but includes another dataframe and conditions. Here's the link: https://stackoverflow.com/q/66916016/15532072 — Yam, Apr 02 '21 at 08:42

Finding the average of values in a column and create a new dataframe that distributes the average

2 Answers2