Given a data-source with fields: product_id
- product
- start_time
- end_time
I am trying to build the logic for catching Overlapping Records for the same product
(based on start_time
and end_time
) using the Dataframe functions.
------------------------------------------------
| product_id | product | start_time | end_time |
------------------------------------------------
| 1 | bottle | 2 | 4 |
| 2 | bottle | 3 | 5 |
| 3 | bottle | 2 | 3 |
| 4 | bottle | 6 | 7 |
| 1 | can | 2 | 4 |
| 2 | can | 5 | 6 |
| 3 | can | 2 | 4 |
I'd like to receive in output
-------------------------------------------------------------------------------------------------
| product_id_a | product_id_b | product | start_time_a | end_time_a | start_time_b | end_time_b |
-------------------------------------------------------------------------------------------------
| 1 | 2 | bottle | 2 | 4 | 3 | 5 |
| 1 | 3 | bottle | 2 | 4 | 2 | 3 |
Because bottle_1
has overlapping times with bottle_2
and bottle_3
, where 2 records are overlapping if the following conditions are satisfied:
max(a.start_time, b.start_time) < min(a.end_time, b.end_time)
!(a.start_time == b.start_time && a.end_time == b.end_time)
a.start_time != b.start_time || a.end_time != b.end_time
where the last 2 conditions just specify that I am not interested in cases in which start_time
and end_time
are equal (e.g. can_1
and can_3
are not in the expected result even if they have same start_time
and end_time
).
For how the problem is structured is easy to think to a solution with MapReduce using RDD, but I am interested to a solution with Dataframes.
HINT: is there any chance that with groupBy().agg()
is possible to specify an interesting condition that reaches the described logic?
For any further explanation feel free to ask
NOT DUPLICATE of How to aggregate over rolling time window with groups in Spark
Unfortunately in the reported answer has been used F.lag
, which in my case is not a good enough condition: F.lag uses a comparison with the previous record only, but in the reported example would not work as desired since that bottle_1
would not be reported as overlapping with bottle_3
because they are not consecutive records