-2

Assuming I have a pandas dataframe like this

Date Val1 Val2 Val 2 Greater than prev
2020-11-13 4 5 NaN
2020-11-14 6 9 yes

What udf could I write that would correctly fill in the Val 2 Greater than prev row?

I know how to solve this with a for loop and index access but assume a udf is easier.

mck
  • 40,932
  • 13
  • 35
  • 50
ayrnee
  • 31
  • 5
  • you dont need a udf here. A plain comparision between Val 2 and val 1 should be enough with a `when` something like `F.when(F.col("Val 2")>F.col("Val 1"),"yes")` <- not tested but should be something similar – anky Jan 31 '21 at 02:06
  • @anky I incorrectly stated my question and have edited. I meant to say that I need a function that returns when the current ROW is greater than the previous row, not column. – ayrnee Jan 31 '21 at 02:12
  • 2
    then one can use the `lead` or `lag` function for comparison. See [pyspark, Compare two rows in dataframe](https://stackoverflow.com/questions/38229659/pyspark-compare-two-rows-in-dataframe) and [Compare Value of Current and Previous Row in Spark](https://stackoverflow.com/questions/46197571/compare-value-of-current-and-previous-row-in-spark) – anky Jan 31 '21 at 02:16

1 Answers1

0

You can use the window function lag to compare val2 with the previous row:

from pyspark.sql import functions as F, Window

df.show()
+----------+----+----+
|      Date|Val1|Val2|
+----------+----+----+
|2020-11-13|   4|   5|
|2020-11-14|   6|   9|
+----------+----+----+

df2 = df.withColumn(
    'Val2_greater_than_prev',
    F.col('Val2') > F.lag('Val2').over(Window.orderBy('Date'))
)

df2.show()
+----------+----+----+----------------------+
|      Date|Val1|Val2|Val2_greater_than_prev|
+----------+----+----+----------------------+
|2020-11-13|   4|   5|                  null|
|2020-11-14|   6|   9|                  true|
+----------+----+----+----------------------+
mck
  • 40,932
  • 13
  • 35
  • 50