Finding the average of the unequal values per month and distribute it based on some conditions

Question

I am currently struggling with converting my data into a useful dataset. I need to evenly distribute payments from the first month up to the last month. The problem is that payments are inconsistent and unequal. Also, there are payments that have been fully paid and should be distributed from the first payment plus the term it is applicable based on the agreement dataframe.

My tables are the following:

1st table: payments

cust_id	agreement_id	date	payment
1	A	12/1/20	200
1	A	2/2/21	200
1	A	2/3/21	100
1	A	5/1/21	200
1	B	1/2/21	50
1	B	1/9/21	20
1	B	3/1/21	80
1	B	4/23/21	90
2	C	1/21/21	600
3	D	3/4/21	150
3	D	5/3/21	150

Here's the code for the payments dataframe:

payments = pd.DataFrame.from_dict({'cust_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 3, 10: 3}, 
                               'agreement_id': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'B', 5: 'B', 6: 'B', 7: 'B',
                                                8: 'C', 9: 'D', 10: 'D'},
                               'date': {0: '12/1/20', 1: '2/2/21', 2: '2/3/21', 3: '5/1/21', 4: '1/2/21', 5: '1/9/21',
                                        6: '3/1/21', 7: '4/23/21', 8: '1/21/21', 9: '3/4/21', 10: '5/3/21'},
                               'payment': {0: 200, 1: 200, 2: 100, 3: 200, 4: 50, 5: 20, 6: 80, 7: 90, 8: 600, 9: 150, 10: 150}})

2nd table: agreement

agreement_id	activation	term_months	total_fee
A	12/1/20	24	4800
B	1/21/21	6	600
C	1/21/21	6	600
D	3/4/21	6	300

Here's the code for the agreement dataframe:

agreement = pd.DataFrame.from_dict({'agreement_id': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}, 'activation': {0: '12/1/20', 1: '1/2/21', 2: '1/21/21', 3: '3/4/21'}, 'term_months': {0: 24, 1: 6, 2: 6, 3: 6}, 'total_fee': {0: 4800, 1: 300, 2: 600, 3: 300}})

The results I wanted is as follows:

cust_id	agreement_id	date	payment
1	A	12/1/20	116.67
1	A	1/1/21	116.67
1	A	2/1/21	116.67
1	A	3/1/21	116.67
1	A	4/1/21	116.67
1	A	5/1/21	116.67
1	B	1/1/21	60
1	B	2/1/21	60
1	B	3/1/21	60
1	B	4/1/21	60
2	C	1/1/21	100
2	C	2/1/21	100
2	C	3/1/21	100
2	C	4/1/21	100
2	C	5/1/21	100
2	C	6/1/21	100
3	D	3/1/21	50
3	D	4/1/21	50
3	D	5/1/21	50
3	D	6/1/21	50
3	D	7/1/21	50
3	D	8/1/21	50

Or, in code form:

    cust_id agreement_id     date  payment
0         1            A  12/1/20   116.67
1         1            A   1/1/21   116.67
2         1            A   2/1/21   116.67
3         1            A   3/1/21   116.67
4         1            A   4/1/21   116.67
5         1            A   5/1/21   116.67
6         1            B   1/1/21    60.00
7         1            B   2/1/21    60.00
8         1            B   3/1/21    60.00
9         1            B   4/1/21    60.00
10        2            C   1/1/21   100.00
11        2            C   2/1/21   100.00
12        2            C   3/1/21   100.00
13        2            C   4/1/21   100.00
14        2            C   5/1/21   100.00
15        2            C   6/1/21   100.00
16        3            D   3/1/21    50.00
17        3            D   4/1/21    50.00
18        3            D   5/1/21    50.00
19        3            D   6/1/21    50.00
20        3            D   7/1/21    50.00
21        3            D   8/1/21    50.00

Activation is the same with the date of first payment.

I tried using the following code (suggested by AlexK) that creates another column but it is only applicable when the total payments are less than the total fee. But, when the total payments are equal to the total fee, I need to distribute the payments accordingly from the start of the payment until the end of the month (start plus terms in months).

payments['date'] = pd.to_datetime(payments['date'])
resampled_payments = (payments
   .set_index('date')
   .groupby(['cust_id', 'agreement_id'])
   .resample('MS')
   .agg({'payment': sum})
   .reset_index()
)

resampled_payments['avg_monthly_payment'] = (resampled_payments
   .groupby(['cust_id', 'agreement_id'])['payment']
   .transform('mean')
)

Hi! Thanks for the question. The condition here is that when the total payments for the agreement_id is less than the total_fee shown in the agreement table, the payments will be spread on all months (monthly) from the start of the first payment until the last payment. But, if payments have already equalled the total_fee, they will be distributed evenly from the first payment until 'date of first payment plus the term in months'. — Yam, Apr 02 '21 at 13:27

AnilGoyal · Answer 1 · 2021-04-02T14:20:29.170

Here is R solution (since you've tagged it with R as well)

#load libraries
library(tidyverse)
library(lubridate)

pymts <- read.table(text = "cust_id agreement_id    date    payment
1   A   12/1/20 200
1   A   2/2/21  200
1   A   2/3/21  100
1   A   5/1/21  200
1   B   1/2/21  50
1   B   1/9/21  20
1   B   3/1/21  80
1   B   4/23/21 90
2   C   1/21/21 600
3   D   3/4/21  150
3   D   5/3/21  150", header = T)

agmt <- read.table(text = "agreement_id activation  term_months total_fee
A   12/1/20 24  4800
B   1/21/21 6   600
C   1/21/21 6   600
D   3/4/21  6   300", header = T)

#final code

final<- pymts %>% mutate(date = as.Date(date, "%m/%d/%y")) %>%
  left_join(agmt %>% mutate(activation = as.Date(activation, "%m/%d/%y")), by = "agreement_id") %>%
  group_by(cust_id, agreement_id) %>%
  mutate(d = n(),
         date = floor_date(date, "month")) %>%
  complete(date = seq.Date(from = min(date), by = "month", length.out = ifelse(sum(payment) == first(total_fee),
                                                                                                    first(term_months), 
                                                                                                   (year(max(date)) -
                                                                                                      year(min(date)))*12 +
                                                                                                      month(max(date)) - 
                                                                                                      month(min(date)) +1))) %>%
  mutate(payment = sum(payment, na.rm = T)) %>%
  filter(!duplicated(date)) %>%
  mutate(payment = payment/n()) %>%
  select(1:4) %>% ungroup()


final
# A tibble: 22 x 4
   cust_id agreement_id date       payment
     <int> <chr>        <date>       <dbl>
 1       1 A            2020-12-01    117.
 2       1 A            2021-01-01    117.
 3       1 A            2021-02-01    117.
 4       1 A            2021-03-01    117.
 5       1 A            2021-04-01    117.
 6       1 A            2021-05-01    117.
 7       1 B            2021-01-01     60 
 8       1 B            2021-02-01     60 
 9       1 B            2021-03-01     60 
10       1 B            2021-04-01     60 
# ... with 12 more rows

I tried this and it works. I hope you can do a python version or someone else does. Thank you very much! — Yam, Apr 02 '21 at 15:36
I don't know python much but I can explain the logic, if you want. — AnilGoyal, Apr 02 '21 at 15:58

Andrea Dalseno · Accepted Answer · 2021-04-07T19:48:05.987

Given your dataframes this should work

from dateutil.relativedelta import relativedelta

# Transofrm column to date
payments['date']= pd.to_datetime(payments['date'])
agreement['activation']= pd.to_datetime(agreement['activation'])

final =pd.merge(payments,agreement,on='agreement_id',how='left')

# set date to beginning of month
final['date'] = pd.to_datetime(final.date).dt.to_period('M').dt.to_timestamp()


def set_date_range(df):
    if df['payment'].sum() == df['total_fee'].iloc[0]:
        return pd.date_range(min(g['date']), periods=df['term_months'].iloc[0], freq='M')
    else:
        return pd.date_range(min(g['date']),
                             max(g['date'])+relativedelta(months=+1), freq='M' )

# Create dataframe with dates
seq_df = pd.DataFrame()
for i,g in final.groupby(['cust_id', 'agreement_id']):
    seq_df = pd.concat([seq_df,
                        pd.DataFrame({'cust_id': i[0], 'agreement_id': i[1],  'date': set_date_range(g)})])

# Set date to beginnig of month
seq_df['date'] = pd.to_datetime(seq_df.date).dt.to_period('M').dt.to_timestamp()

final = (pd.concat([final, seq_df], sort=True)
              .sort_values(['cust_id', 'agreement_id', 'date'])
              .reset_index(drop=True)
              .reindex(columns=final.columns))

final['payment'] = final.groupby(by=['cust_id', 'agreement_id'])["payment"].transform("sum")

final = final.drop_duplicates(['cust_id', 'agreement_id', 'date'])

final['n'] = final.groupby(by=['cust_id', 'agreement_id'])["cust_id"].transform("count")
final['payment_due'] = final['payment']/final['n']
final[['cust_id','agreement_id','date', 'payment_due']]

I haven't been able to exactly replicate the pipe form tidyverse but the output should match. The hardest part has been the creation of seq_df but it should be fine (double test it for more general use case)

Finding the average of the unequal values per month and distribute it based on some conditions

2 Answers2

Linked