6

I have a DateFrame as follow:

+---+---------------------+---------------------+
|id |initDate             |endDate              |
+---+---------------------+---------------------+
|138|2016-04-15 00:00:00.0|2016-04-28 00:00:00.0|
|138|2016-05-09 00:00:00.0|2016-05-23 00:00:00.0|
|138|2016-06-04 00:00:00.0|2016-06-18 00:00:00.0|
|138|2016-06-18 00:00:00.0|2016-07-02 00:00:00.0|
|138|2016-07-09 00:00:00.0|2016-07-23 00:00:00.0|
|138|2016-07-27 00:00:00.0|2016-08-10 00:00:00.0|
|138|2016-08-18 00:00:00.0|2016-09-01 00:00:00.0|
|138|2016-09-13 00:00:00.0|2016-09-27 00:00:00.0|
|138|2016-10-04 00:00:00.0|null                 |
+---+---------------------+---------------------+

The rows are ordered by id then initDate column in ascending order. Both initDate and endDate columns have Timestamp type. For illustrative purpose, I just showed the records belonging to one id value.

My goal is to add a new column, showing for each id the difference (in term of days) between the initDate of each row and the endDate of the previous row.

If there is no previous row, then the value will be -1.

The output should look like this:

+---+---------------------+---------------------+----------+
|id |initDate             |endDate              |difference|
+---+---------------------+---------------------+----------+
|138|2016-04-15 00:00:00.0|2016-04-28 00:00:00.0|-1        |
|138|2016-05-09 00:00:00.0|2016-05-23 00:00:00.0|11        |
|138|2016-06-04 00:00:00.0|2016-06-18 00:00:00.0|12        |
|138|2016-06-18 00:00:00.0|2016-07-02 00:00:00.0|0         |
|138|2016-07-09 00:00:00.0|2016-07-23 00:00:00.0|7         |
|138|2016-07-27 00:00:00.0|2016-08-10 00:00:00.0|4         |
|138|2016-08-18 00:00:00.0|2016-09-01 00:00:00.0|8         |
|138|2016-09-13 00:00:00.0|2016-09-27 00:00:00.0|12        |
|138|2016-10-04 00:00:00.0|null                 |7         |
+---+---------------------+---------------------+----------+

I am thinking to use a window function to partition the records by id, but I am not figuring how to do the next steps.

Rami
  • 8,044
  • 18
  • 66
  • 108
  • Why do you want to find the difference between initdate of current row and end date of previous row on manually sorted records? – Shankar Oct 31 '16 at 13:03
  • Yes, what do you mean by "manually" sorted records? The records are sorted no matter if manually or not. – Rami Oct 31 '16 at 13:05
  • Well, I need to calculate this difference, I am asking if there is a way to calculate it. It is equivalent to calculating the difference between the endDate of the current row and the initDate of the next one. – Rami Oct 31 '16 at 13:07
  • Can you store the DF as temp table and try to write a query like in this thread http://stackoverflow.com/questions/9994862/date-difference-between-consecutive-rows – Shankar Oct 31 '16 at 13:11

3 Answers3

7

Thanks to the hint of @lostInOverflow, I came up with the following solution:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

val w = Window.partitionBy("id").orderBy("initDate")
val previousEnd = lag($"endDate", 1).over(w)
filteredDF.withColumn("prev", previousEnd)
          .withColumn("difference", datediff($"initDate", $"prev"))
Rami
  • 8,044
  • 18
  • 66
  • 108
5

Try:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

val w = Window.partitionBy("id").orderBy("endDate")

df.withColumn("difference", date_sub($"initDate", lag($"endDate", 1).over(w)))
Omley
  • 426
  • 6
  • 17
  • Doesn't date_sub take the second argument being number of days? I think datediff is the function needed. – joelc Aug 07 '19 at 08:47
0

Just an addition to previously good answers, in case anyone wants to try with spark sql or on Hive.

select tab.tran_id,tab.init_date,tab.end_date,coalesce(tab.day_diff,-1)
as day_diffrence from
(select *,datediff(day,lag(end_date,1) over(partition by tran_id order by init_date)
,init_date) as day_diff from your_table) tab
;
vikrant rana
  • 4,509
  • 6
  • 32
  • 72