I have a dataframe with cards, time and amount and I need to aggregate card's amount (sum and count) with a one month window.
Here are how data looks like :
+--------------------+-------------------+------------+
| card_uid| date|amount_local|
+--------------------+-------------------+------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30| 8.99|
|card_0026uGZQwZQd...|2016-05-06 12:16:18| 16.19|
|card_0026uGZQwZQd...|2016-07-06 12:17:57| 16.19|
|card_003STfrgB8SZ...|2016-12-04 10:05:21| 58.8|
|card_005gBxyiDc6b...|2016-09-10 18:58:25| 27.95|
|card_005gBxyiDc6b...|2016-11-12 11:18:29| 12.99|
This is what I've made so far.
+--------------------+-------------------+------------+----------------+
| card_uid| date|amount_local|duration_cum_sum|
+--------------------+-------------------+------------+----------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30| 8.99| 8.99|
|card_0026uGZQwZQd...|2016-05-06 12:16:18| 16.19| 16.19|
|card_0026uGZQwZQd...|2016-07-06 12:17:57| 16.19| 32.38|
|card_003STfrgB8SZ...|2016-12-04 10:05:21| 58.8| 58.8|
|card_005gBxyiDc6b...|2016-09-10 18:58:25| 27.95| 27.95|
|card_005gBxyiDc6b...|2016-11-12 11:18:29| 12.99| 40.94|
With window functions below :
partition = Window.partitionBy("card_uid").orderBy("date")
previousTransactionDate = data.withColumn("previous_tr_time", lag(data.date).over(partition)).select("transaction_id", "card_uid", "date", "previous_tr_time")
df_cum_sum = data.withColumn("duration_cum_sum", sum('amount_local').over(partition))
df_cum_sum.orderBy("card_uid","date").select("card_uid", "date", "amount_local", "duration_cum_sum").show()
But the only thing I want to add is two things :
- Aggregate the same way, only if date are inferior to one month
- Put zero instead of the same amount for the cum_sum
So the needed ouput looks like this :
+--------------------+-------------------+------------+----------------+
| card_uid| date|amount_local|duration_cum_sum|
+--------------------+-------------------+------------+----------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30| 8.99| 0|
|card_0026uGZQwZQd...|2016-05-06 12:16:18| 16.19| 0|
|card_0026uGZQwZQd...|2016-05-12 12:17:57| 4.00| 16.19|
|card_0026uGZQwZQd...|2016-06-06 12:23:51| 16.19| 4.00| => Only 4 because de 16.19 was more than one month ago
|card_003STfrgB8SZ...|2016-12-04 10:05:21| 58.8| 0|
|card_005gBxyiDc6b...|2016-09-10 18:58:25| 27.95| 0|
|card_005gBxyiDc6b...|2016-09-12 11:18:29| 12.99| 27.95| => Previous amount
|card_005gBxyiDc6b...|2016-09-22 14:25:44| 23.99| 40.94| => 27.95 + 12.99
I can't groupBy card_uid since I need the same number of rows as the original to link to another table