I have a pyspark dataframe with 5 columns: Id, a value X, lower & upper bounds of X and the update date (this dataframe is ordered by "Id, Update"). I read it from a hive table:
(spark.sql(Select *from table1 ordered by Update))
+---+----------+----------+----------+----------+
| Id| X| LB| UB| Update|
+---+----------+----------+----------+----------+
| 1|2019-01-20|2019-01-15|2019-01-25|2019-01-02|
| 1|2019-01-17|2019-01-15|2019-01-25|2019-01-03|
| 1|2019-01-10|2019-01-15|2019-01-25|2019-01-05|
| 1|2019-01-12|2019-01-15|2019-01-25|2019-01-07|
| 1|2019-01-15|2019-01-15|2019-01-25|2019-01-08|
| 2|2018-12-12|2018-12-07|2018-12-17|2018-11-17|
| 2|2018-12-15|2018-12-07|2018-12-17|2018-11-18|
When "X" is lower than "LB" or greater than "UB", "LB" & "UB" will be re-computed according to X and for all the following rows having the same Id.
if(X<LB | X>UB) LB = X-5 (in days)
UB = X+5 (in days)
The result should be like that:
+---+----------+----------+----------+----------+
| Id| X| LB| UB| Update|
+---+----------+----------+----------+----------+
| 1|2019-01-20|2019-01-15|2019-01-25|2019-01-02|
| 1|2019-01-17|2019-01-15|2019-01-25|2019-01-03|
| 1|2019-01-10|2019-01-05|2019-01-15|2019-01-05|
| 1|2019-01-12|2019-01-05|2019-01-15|2019-01-07|
| 1|2019-01-15|2019-01-05|2019-01-15|2019-01-08|
| 2|2018-12-12|2018-12-07|2018-12-17|2018-11-17|
| 2|2018-12-15|2018-12-07|2018-12-17|2018-11-18|
The third, forth & fifth rows are changed. How can achieve this?