pyspark: how to modify column value based on other columns for the same Id

Question

I have a pyspark dataframe with 5 columns: Id, a value X, lower & upper bounds of X and the update date (this dataframe is ordered by "Id, Update"). I read it from a hive table:

(spark.sql(Select *from table1 ordered by Update))

    +---+----------+----------+----------+----------+
    | Id|         X|        LB|        UB|    Update|
    +---+----------+----------+----------+----------+
    |  1|2019-01-20|2019-01-15|2019-01-25|2019-01-02|
    |  1|2019-01-17|2019-01-15|2019-01-25|2019-01-03|
    |  1|2019-01-10|2019-01-15|2019-01-25|2019-01-05|
    |  1|2019-01-12|2019-01-15|2019-01-25|2019-01-07|
    |  1|2019-01-15|2019-01-15|2019-01-25|2019-01-08|
    |  2|2018-12-12|2018-12-07|2018-12-17|2018-11-17|
    |  2|2018-12-15|2018-12-07|2018-12-17|2018-11-18|

When "X" is lower than "LB" or greater than "UB", "LB" & "UB" will be re-computed according to X and for all the following rows having the same Id.

    if(X<LB | X>UB) LB = X-5 (in days)
                    UB = X+5 (in days)

The result should be like that:

    +---+----------+----------+----------+----------+
    | Id|         X|        LB|        UB|    Update|
    +---+----------+----------+----------+----------+
    |  1|2019-01-20|2019-01-15|2019-01-25|2019-01-02|
    |  1|2019-01-17|2019-01-15|2019-01-25|2019-01-03|
    |  1|2019-01-10|2019-01-05|2019-01-15|2019-01-05|
    |  1|2019-01-12|2019-01-05|2019-01-15|2019-01-07|
    |  1|2019-01-15|2019-01-05|2019-01-15|2019-01-08|
    |  2|2018-12-12|2018-12-07|2018-12-17|2018-11-17|
    |  2|2018-12-15|2018-12-07|2018-12-17|2018-11-18|

The third, forth & fifth rows are changed. How can achieve this?

you can use case statement in sql https://stackoverflow.com/questions/25157451/spark-sql-case-when-then — Ranga Vure, Mar 25 '19 at 07:32

score 0 · Answer 1 · answered Mar 25 '19 at 14:29

Try Case statement within Select Expression-

df.selectExpr("Id AS Id",
              "X AS X",
              "CASE WHEN X<LB OR X>UB THEN date_sub(X,5) ELSE LB END AS LB",
              "CASE WHEN X<LB OR X>UB THEN date_add(X,5) ELSE UB END AS UB",
              "Update AS Update").show()

pyspark: how to modify column value based on other columns for the same Id

1 Answers1