Window function sum, multiplied by condition

Question

I am reviewing a code and would love to have a bit more clarity.

Here is my PySpark Dataframe:

YEAR_A	YEAR_B	AMOUNT
2000	2001	5
2000	2000	4
2000	2001	3

I initiate a window function:

window = Window.partitionBy('YEAR_A')

Then I would love some help to understand the following part, especially after the over(window).

df = (df.withColumn("newcolumn", F.sum("AMOUNT").over(window) *(F.col("YEAR_B") == F.col("YEAR_A")).cast("integer")))

Is it supposed to create a "newcolumn" to my dataframe with the sum of "AMOUNT" of the current YEAR_A and write it only if "YEAR_A" is equal to "YEAR_B" (otherwise write nan)? or am I missing something?

ZygD · Answer 1 · 2022-10-04T14:40:21.823

(F.col("YEAR_B") == F.col("YEAR_A")) compares both columns. If the values in the row are equal, you get True, if they are not equal, you get False.

.cast("integer") makes the integer out of the previous result. True becomes 1, False becomes 0.

F.sum("AMOUNT").over(window) * - you multiply the result of the window function with the result of above. When you multiply by 1, you get the value of the window function. When you multiply by 0, you get 0.

There's nothing written about nan. Spark does not return nan generally.

Window function sum, multiplied by condition

1 Answers1