I have a PySpark dataframe like this:
+--------+-------------+--------------+-----------------------+
|material|purchase_date|mkt_prc_usd_lb|min_mkt_prc_over_1month|
+--------+-------------+--------------+-----------------------+
| Copper| 2019-01-09| 2.6945| 2.6838|
| Copper| 2019-01-23| 2.6838| 2.6838|
| Zinc| 2019-01-23| 1.1829| 1.1829|
| Zinc| 2019-06-26| 1.1918| 1.1918|
|Aluminum| 2019-01-02| 0.8363| 0.8342|
|Aluminum| 2019-01-09| 0.8342| 0.8342|
|Aluminum| 2019-01-23| 0.8555| 0.8342|
|Aluminum| 2019-04-03| 0.8461| 0.8461|
+--------+-------------+--------------+-----------------------+
The last column 'min_mkt_prc_over_1month' is calculated as the minimum 'mkt_prc_usd_lb' (3rd column) over a month for the material, i.e (-15 days, to +15days) over material, purchase_date window:
The code is:
w2 = (Window()
.partitionBy("material")
.orderBy(col("purchase_date").cast("timestamp").cast("long"))
.rangeBetween(-days(15), days(15)))
Now, I want to see what is the 'purchase_date' when the amount was/will be minimum?
Expected Output: (from the first two rows)
+--------+-------------+--------------+-----------------------+------------------+
|material|purchase_date|mkt_prc_usd_lb|min_mkt_prc_over_1month|date_of_min_price |
+--------+-------------+--------------+-----------------------+------------------+
| Copper| 2019-01-09| 2.6945| 2.6838| 2019-01-23|
| Copper| 2019-01-23| 2.6838| 2.6838| 2019-01-23|
+--------+-------------+--------------+-----------------------+------------------+