Pyspark - Get cumulative sum of of a column with condition

Question

I have a dataframe with cards, time and amount and I need to aggregate card's amount (sum and count) with a one month window.

Here are how data looks like :

+--------------------+-------------------+------------+
|            card_uid|               date|amount_local|
+--------------------+-------------------+------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30|        8.99|
|card_0026uGZQwZQd...|2016-05-06 12:16:18|       16.19|
|card_0026uGZQwZQd...|2016-07-06 12:17:57|       16.19|
|card_003STfrgB8SZ...|2016-12-04 10:05:21|        58.8|
|card_005gBxyiDc6b...|2016-09-10 18:58:25|       27.95|
|card_005gBxyiDc6b...|2016-11-12 11:18:29|       12.99|

This is what I've made so far.

+--------------------+-------------------+------------+----------------+
|            card_uid|               date|amount_local|duration_cum_sum|
+--------------------+-------------------+------------+----------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30|        8.99|            8.99|
|card_0026uGZQwZQd...|2016-05-06 12:16:18|       16.19|           16.19|
|card_0026uGZQwZQd...|2016-07-06 12:17:57|       16.19|           32.38|
|card_003STfrgB8SZ...|2016-12-04 10:05:21|        58.8|            58.8|
|card_005gBxyiDc6b...|2016-09-10 18:58:25|       27.95|           27.95|
|card_005gBxyiDc6b...|2016-11-12 11:18:29|       12.99|           40.94|

With window functions below :

partition = Window.partitionBy("card_uid").orderBy("date")

previousTransactionDate = data.withColumn("previous_tr_time", lag(data.date).over(partition)).select("transaction_id", "card_uid", "date", "previous_tr_time") 

df_cum_sum = data.withColumn("duration_cum_sum", sum('amount_local').over(partition))

df_cum_sum.orderBy("card_uid","date").select("card_uid", "date", "amount_local", "duration_cum_sum").show()

But the only thing I want to add is two things :

Aggregate the same way, only if date are inferior to one month
Put zero instead of the same amount for the cum_sum

So the needed ouput looks like this :

+--------------------+-------------------+------------+----------------+
|            card_uid|               date|amount_local|duration_cum_sum|
+--------------------+-------------------+------------+----------------+
|card_001H4Mw1Ha0M...|2016-05-04 17:54:30|        8.99|               0|
|card_0026uGZQwZQd...|2016-05-06 12:16:18|       16.19|               0|
|card_0026uGZQwZQd...|2016-05-12 12:17:57|        4.00|           16.19|
|card_0026uGZQwZQd...|2016-06-06 12:23:51|       16.19|            4.00| => Only 4 because de 16.19 was more than one month ago
|card_003STfrgB8SZ...|2016-12-04 10:05:21|        58.8|               0|
|card_005gBxyiDc6b...|2016-09-10 18:58:25|       27.95|               0|
|card_005gBxyiDc6b...|2016-09-12 11:18:29|       12.99|           27.95| => Previous amount 
|card_005gBxyiDc6b...|2016-09-22 14:25:44|       23.99|           40.94| => 27.95 + 12.99

I can't groupBy card_uid since I need the same number of rows as the original to link to another table

Possible duplicate of [How to aggregate over rolling time window with groups in Spark](https://stackoverflow.com/questions/41711716/how-to-aggregate-over-rolling-time-window-with-groups-in-spark) — 10465355, Jan 10 '19 at 17:40
Well, this is not working as long as I need a cumulative sum without going though groupby. This can't help me because I can't get the link to the previous id of the transaction — LaSul, Jan 10 '19 at 18:12

score 6 · Accepted Answer · answered Jan 10 '19 at 21:23

You need a rolling window on date with window ranging from past 30 days to previous day. Since interval functions are not available for window, you can convert the dates into long values and use the days long value to create window range.

from pyspark.sql.functions import *
days = lambda i: i * 86400 

partition = Window.partitionBy("card_uid").orderBy(col("date").cast("timestamp").cast("long")).rangeBetween(days(-30), days(-1))

df_cum_sum = data.withColumn("duration_cum_sum",sum(col('amount_local')).over(partition))\
                 .fillna(0,subset=['duration_cum_sum'])
df_cum_sum.show()

Thanks !! It works like I want ! :) I just changed days(-30) to (-31) because this was missing the last day but this is perfect — LaSul, Jan 11 '19 at 08:44

Pyspark - Get cumulative sum of of a column with condition

1 Answers1