I am reviewing a code and would love to have a bit more clarity.
Here is my PySpark Dataframe:
YEAR_A | YEAR_B | AMOUNT |
---|---|---|
2000 | 2001 | 5 |
2000 | 2000 | 4 |
2000 | 2001 | 3 |
I initiate a window function:
window = Window.partitionBy('YEAR_A')
Then I would love some help to understand the following part, especially after the over(window)
.
df = (df.withColumn("newcolumn", F.sum("AMOUNT").over(window) *(F.col("YEAR_B") == F.col("YEAR_A")).cast("integer")))
Is it supposed to create a "newcolumn" to my dataframe with the sum of "AMOUNT" of the current YEAR_A and write it only if "YEAR_A" is equal to "YEAR_B" (otherwise write nan)? or am I missing something?