I frequently come across the use case that I have (a time ordered) Spark dataframe with values from which I should like to know the differences between consecutive rows:
>>> df.show()
+-----+----------+----------+
|index| c1| c2|
+-----+----------+----------+
| 0.0|0.35735932|0.39612636|
| 1.0| 0.7279809|0.54678476|
| 2.0|0.68788993|0.25862947|
| 3.0| 0.645063| 0.7470685|
+-----+----------+----------+
The question on how to do this has been asked before in a narrower context:
pyspark, Compare two rows in dataframe
Date difference between consecutive rows - Pyspark Dataframe
However, I find the answers rather involved:
- a separate module "Window" must be imported
- for some data types (datetimes) a cast must be done
- then using "lag" finally the rows can be compared
It strikes me as odd, that this cannot be done with a single API call like, for example, so:
>>> import pyspark.sql.functions as f
>>> df.select(f.diffs(df.c1)).show()
+----------+
| diffs(c1)|
+----------+
| 0.3706 |
| -0.0400 |
| -0.0428 |
| null |
+----------+
What is the reason for this?