0

I am new to Apache Spark, I have a use case to find the date gap identification between multiple dates.

e.g

In the above example, the member had a gap between 2018-02-01 to 2018-02-14. How to find this Apache Spark 2.3.4 using Scala.

Excepted output for the above scenario is,

enter image description here

Gowri24
  • 23
  • 8

1 Answers1

1

You could use datediff along with Window function lag to check for day-gaps between current and previous rows, and compute the missing date ranges with some date functions:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
import java.sql.Date

val df = Seq(
  (1, Date.valueOf("2018-01-01"), Date.valueOf("2018-01-31")),
  (1, Date.valueOf("2018-02-16"), Date.valueOf("2018-02-28")),
  (1, Date.valueOf("2018-03-01"), Date.valueOf("2018-03-31")),
  (2, Date.valueOf("2018-07-01"), Date.valueOf("2018-07-31")),
  (2, Date.valueOf("2018-08-16"), Date.valueOf("2018-08-31"))
).toDF("MemberId", "StartDate", "EndDate")

val win = Window.partitionBy("MemberId").orderBy("StartDate", "EndDate")

df.
  withColumn("PrevEndDate", coalesce(lag($"EndDate", 1).over(win), date_sub($"StartDate", 1))).
  withColumn("DayGap", datediff($"StartDate", $"PrevEndDate")).
  where($"DayGap" > 1).
  select($"MemberId", date_add($"PrevEndDate", 1).as("StartDateGap"), date_sub($"StartDate", 1).as("EndDateGap")).
  show
// +--------+------------+----------+
// |MemberId|StartDateGap|EndDateGap|
// +--------+------------+----------+
// |       1|  2018-02-01|2018-02-15|
// |       2|  2018-08-01|2018-08-15|
// +--------+------------+----------+
Leo C
  • 22,006
  • 3
  • 26
  • 39
  • 1
    This one is good, but it fails in certain cases where dates overlap, e.g. if you add `(2, Date.valueOf("2018-05-01"), Date.valueOf("2018-08-05")` to the DF you will still get the same result as above. Still a good answer, assuming steps are taken to remove redundant data. – Charlie Flowers Jan 17 '20 at 22:17
  • Thanks, but it fails when there are overlapping dates e.g if you add (1,Date.valueOf("2018-01-15"),Date.valueOf("2018-01-30")) to the DF you will get StartDateGap as "2018-01-31" however this member has an enrollment on "2018-01-31". Could you kindly tell me how to remove the overlapping days before computing this.? – Gowri24 Jan 17 '20 at 22:50
  • 1
    Leo, I found one of your answers to remove overlapping dates (https://stackoverflow.com/questions/52877237/in-spark-scala-how-to-check-overlapping-dates-from-adjacent-rows-in-a-dataframe). Thanks a lot. I will try my best to merge these two. willl let you know if i need any more help. Thanks a lot. – Gowri24 Jan 17 '20 at 23:20
  • 1
    @Gowri24, yes the above solution is based on your simplistic sample data without overlapping of date ranges. Combining my other solution with this one will work. After applying the other solution, simply apply to the resulting DataFrame `resDF.withColumn("RowNum", row_number.over(win3)).where($"RowNum" === 1)`, with `win3` defined as `Window.partitionBy("MemberId", "StartDate", "EndDate").orderBy("StartDate", "EndDate")`, followed by applying the above transformations for gaps finding. – Leo C Jan 18 '20 at 00:56
  • @LeoC, Thanks for your response. – Gowri24 Jan 18 '20 at 01:24