First of all, you need a column for order. Spark does not know the order. All the rows can be at any position unless you have a column which can tell exact order. I have added the column "order".
Secondly, referencing the same column is not possible per se. You need a workaround. The following will collect date into a list for every "Cust_ID" and do operations only within this list. When it's done, explode the result using inline
.
Input:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 1, 5),
(1, 2, 2),
(1, 3, 1),
(1, 4, -9),
(1, 5, -2),
(1, 6, 9),
(5, 1, -2),
(5, 2, 1),
(5, 3, -1)],
['Cust_ID', 'order', 'Fill_days'])
Script:
df = df.groupBy('Cust_ID').agg(
F.aggregate(
F.array_sort(F.collect_list(F.struct('order', 'Fill_days'))),
F.expr("array(struct(bigint(null) order, 0L Fill_days, 0L Adjusted_Days))"),
lambda acc, x: F.array_union(
acc,
F.array(x.withField(
'Adjusted_Days',
F.greatest(F.lit(0), F.element_at(acc, -1)['Fill_days'] + F.element_at(acc, -1)['Adjusted_Days'])
))
)
).alias('a')
)
df = df.selectExpr("Cust_ID", "inline(slice(a, 2, size(a)))")
df.show()
# +-------+-----+---------+-------------+
# |Cust_ID|order|Fill_days|Adjusted_Days|
# +-------+-----+---------+-------------+
# |1 |1 |5 |0 |
# |1 |2 |2 |5 |
# |1 |3 |1 |7 |
# |1 |4 |-9 |8 |
# |1 |5 |-2 |0 |
# |1 |6 |9 |0 |
# |5 |1 |-2 |0 |
# |5 |2 |1 |0 |
# |5 |3 |-1 |1 |
# +-------+-----+---------+-------------+
For understanding, please analyze this answer, as it's not an easy task to explain it for the 2nd time.