0

I have a dataframe which I want to transform as below output where each row start_duration and end_duration will be resulted from previous row start_duration and end_duration, please let me know how to achieve it in spark using scala.

Below is the formulae to calculate start_duration and end_duration :

start_duration = max(previous end_duration + 1, current date); 
end_duration = min(presciption_end date, start_duration + duration – 1)

Input Data Frame :

+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
|prescription_uid|patient_uid|ndc      |label      |dispensation_uid|date      |duration|start_date|end_date  |
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+
|0               |0          |16714-128|sinvastatin|0               |2015-06-10|30      |2015-06-01|2015-12-01|
|0               |0          |16714-128|sinvastatin|1               |2015-07-15|30      |2015-06-01|2015-12-01|
|0               |0          |16714-128|sinvastatin|2               |2015-08-01|30      |2015-06-01|2015-12-01|
|0               |0          |16714-128|sinvastatin|3               |2015-10-01|30      |2015-06-01|2015-12-01|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+

EXPECTED RESULT:
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|prescription_uid|patient_uid|ndc      |label      |dispensation_uid|date      |duration|start_date|end_date  |first_start_duration|first_end_duration|start_duration|end_duration|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|0               |0          |16714-128|sinvastatin|0               |2015-06-10|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-06-10    |2015-07-09  |
|0               |0          |16714-128|sinvastatin|1               |2015-07-15|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-07-15    |2015-08-13  |
|0               |0          |16714-128|sinvastatin|2               |2015-08-01|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-08-14    |2015-09-13  |
|0               |0          |16714-128|sinvastatin|3               |2015-10-01|30      |2015-06-01|2015-12-01|2015-06-10          |2015-07-09        |2015-10-01    |2015-10-30  |
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+

Continued question from previous forum solution didn't worked when there is gap in between rows like above example gap is in third and last row :

https://stackoverflow.com/questions/64396803/how-to-apply-window-function-in-memory-transformation-with-new-column-scala/64405160#64405160

2 Answers2

0

Break you problem in two parts.

1 Use lag to get previous and lead to (sample ) and crate new column

2 Use least (end_duration) and greatest (start_duration) to get.(sample link

I can help in sql .

sandeep rawat
  • 4,797
  • 1
  • 18
  • 36
0
    prescription_uid,patient_uid,ndc,label,dispensation_uid,date,duration,start_date,end_date
    0,0 ,16714-128,sinvastatin,0,2015-06-10,30,2015-06-01,2015-12-01
    0,0 ,16714-128,sinvastatin,1,2015-07-15,30,2015-06-01,2015-12-01
    0,0 ,16714-128,sinvastatin,2,2015-08-01,30,2015-06-01,2015-12-01
    0,0 ,16714-128,sinvastatin,3,2015-10-01,30,2015-06-01,2015-12-01
    
    var df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").load("file:///home/xxxx/Ram/sample_stack.csv")

    var date=df.select("date").map(r=>r(0)).collect
    var dt=data(0).toString
    df=df.withColumn("first_start_duration",lit(dt))
    val date_add = udf((x: String, y: Int) => {
        val sdf = new SimpleDateFormat("yyyy-MM-dd")
        var z=y-1
        val result = new Date(sdf.parse(x).getTime() + TimeUnit.DAYS.toMillis(z))
      sdf.format(result)
    })
    df=df.withColumn("first_end_duration", date_add($"first_start_duration", $"duration"))
    df=df.withColumn("start_duration",df("date"))
    df=df.withColumn("end_duration",date_add($"start_duration", $"duration"))

Result : 

+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|prescription_uid|patient_uid|      ndc|      label|dispensation_uid|      date|duration|start_date|  end_date|first_start_duration|first_end_duration|start_duration|end_duration|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+
|               0|          0|16714-128|sinvastatin|               0|2015-06-10|      30|2015-06-01|2015-12-01|          2015-06-10|        2015-07-09|    2015-06-10|  2015-07-09|
|               0|          0|16714-128|sinvastatin|               1|2015-07-15|      30|2015-06-01|2015-12-01|          2015-06-10|        2015-07-09|    2015-07-15|  2015-08-13|
|               0|          0|16714-128|sinvastatin|               2|2015-08-01|      30|2015-06-01|2015-12-01|          2015-06-10|        2015-07-09|    2015-08-01|  2015-08-30|
|               0|          0|16714-128|sinvastatin|               3|2015-10-01|      30|2015-06-01|2015-12-01|          2015-06-10|        2015-07-09|    2015-10-01|  2015-10-30|
+----------------+-----------+---------+-----------+----------------+----------+--------+----------+----------+--------------------+------------------+--------------+------------+