I have a dataframe like below and want to reduce them by combining adjacent rowa i.e previous.close = current.open
val df = Seq(
("Ray","2018-09-01","2018-09-10"),
("Ray","2018-09-10","2018-09-15"),
("Ray","2018-09-16","2018-09-18"),
("Ray","2018-09-21","2018-09-27"),
("Ray","2018-09-27","2018-09-30"),
("Scott","2018-09-21","2018-09-23"),
("Scott","2018-09-24","2018-09-28"),
("Scott","2018-09-28","2018-09-30"),
("Scott","2018-10-05","2018-10-09"),
("Scott","2018-10-11","2018-10-15"),
("Scott","2018-10-15","2018-09-20")
)
The required output is below:
(("Ray","2018-09-01","2018-09-15"),
("Ray","2018-09-16","2018-09-18"),
("Ray","2018-09-21","2018-09-30"),
("Scott","2018-09-21","2018-09-23"),
("Scott","2018-09-24","2018-09-30"),
("Scott","2018-10-05","2018-10-09"),
("Scott","2018-10-11","2018-10-20"))
So, far, I'm able to condense the adjacent rows by using the below DF() solution.
df.alias("t1").join(df.alias("t2"),$"t1.name" === $"t2.name" and $"t1.close"=== $"t2.open" )
.select("t1.name","t1.open","t2.close")
.distinct.show(false)
|name |open |close |
+-----+----------+----------+
|Scott|2018-09-24|2018-09-30|
|Scott|2018-10-11|2018-09-20|
|Ray |2018-09-01|2018-09-15|
|Ray |2018-09-21|2018-09-30|
+-----+----------+----------+
I'm trying to use similar style to get single rows by giving $"t1.close"=!= $"t2.open" and then doing a union of both to get the final result. But I get unwanted rows, which I'm not able to filter correctly. How to achieve this?.
This post is not same as Spark SQL window function with complex condition where it calculates additional date column as a new column.