Pyspark Dataframe udf dependent on previous index value

Question

Assuming I have a pandas dataframe like this

Date	Val1	Val2	Val 2 Greater than prev
2020-11-13	4	5	NaN
2020-11-14	6	9	yes

What udf could I write that would correctly fill in the Val 2 Greater than prev row?

I know how to solve this with a for loop and index access but assume a udf is easier.

you dont need a udf here. A plain comparision between Val 2 and val 1 should be enough with a `when` something like `F.when(F.col("Val 2")>F.col("Val 1"),"yes")` <- not tested but should be something similar — anky, Jan 31 '21 at 02:06
@anky I incorrectly stated my question and have edited. I meant to say that I need a function that returns when the current ROW is greater than the previous row, not column. — ayrnee, Jan 31 '21 at 02:12
then one can use the `lead` or `lag` function for comparison. See [pyspark, Compare two rows in dataframe](https://stackoverflow.com/questions/38229659/pyspark-compare-two-rows-in-dataframe) and [Compare Value of Current and Previous Row in Spark](https://stackoverflow.com/questions/46197571/compare-value-of-current-and-previous-row-in-spark) — anky, Jan 31 '21 at 02:16

score 0 · Accepted Answer · answered Jan 31 '21 at 07:58

You can use the window function lag to compare val2 with the previous row:

from pyspark.sql import functions as F, Window

df.show()
+----------+----+----+
|      Date|Val1|Val2|
+----------+----+----+
|2020-11-13|   4|   5|
|2020-11-14|   6|   9|
+----------+----+----+

df2 = df.withColumn(
    'Val2_greater_than_prev',
    F.col('Val2') > F.lag('Val2').over(Window.orderBy('Date'))
)

df2.show()
+----------+----+----+----------------------+
|      Date|Val1|Val2|Val2_greater_than_prev|
+----------+----+----+----------------------+
|2020-11-13|   4|   5|                  null|
|2020-11-14|   6|   9|                  true|
+----------+----+----+----------------------+

Pyspark Dataframe udf dependent on previous index value

1 Answers1