Why is there no generic method to diff consecutive rows in pyspark dataframes/rdds?

Question

I frequently come across the use case that I have (a time ordered) Spark dataframe with values from which I should like to know the differences between consecutive rows:

>>> df.show()
+-----+----------+----------+
|index|        c1|        c2|
+-----+----------+----------+
|  0.0|0.35735932|0.39612636|
|  1.0| 0.7279809|0.54678476|
|  2.0|0.68788993|0.25862947|
|  3.0|  0.645063| 0.7470685|
+-----+----------+----------+

The question on how to do this has been asked before in a narrower context:

pyspark, Compare two rows in dataframe

Date difference between consecutive rows - Pyspark Dataframe

However, I find the answers rather involved:

a separate module "Window" must be imported
for some data types (datetimes) a cast must be done
then using "lag" finally the rows can be compared

It strikes me as odd, that this cannot be done with a single API call like, for example, so:

>>> import pyspark.sql.functions as f
>>> df.select(f.diffs(df.c1)).show()
+----------+
| diffs(c1)|
+----------+
|   0.3706 |
|  -0.0400 |
|  -0.0428 |
|     null |
+----------+

What is the reason for this?

In both linked questions you can find answers using `lag` function: http://stackoverflow.com/a/38230813, http://stackoverflow.com/a/38159608. Can you imagine API simpler than using one function? :-) — Mariusz, Dec 23 '16 at 18:12

score 3 · Accepted Answer · answered Dec 24 '16 at 21:34

There are a few basic reasons:

In general distributed data structures used in Spark are not ordered. In particular any lineage containing shuffle phase / exchange can output a structure with non-deterministic order.

As a result when we talk about Spark DataFrame we actually mean relation not DataFrame as known from local libraries like Pandas and without explicit ordering comparing consecutive rows is just not meaningful.
Things are even more fuzzy when you realize that sorting methods used in Spark use shuffles and are not stable.
If you ignore possible non-determinism handling partition boundaries is a bit involved and typically breaks lazy execution.

In other words you cannot access an element which is left from the first element on a given partition or right from the last element of a given partition without a shuffle, an additional action or separate data scan.

Why is there no generic method to diff consecutive rows in pyspark dataframes/rdds?

1 Answers1