-1

Given a data-source with fields: product_id - product - start_time - end_time

I am trying to build the logic for catching Overlapping Records for the same product (based on start_time and end_time) using the Dataframe functions.

------------------------------------------------
| product_id | product | start_time | end_time |
------------------------------------------------
|      1     | bottle  |     2      |    4     |
|      2     | bottle  |     3      |    5     |
|      3     | bottle  |     2      |    3     |
|      4     | bottle  |     6      |    7     |
|      1     |   can   |     2      |    4     |
|      2     |   can   |     5      |    6     |
|      3     |   can   |     2      |    4     |

I'd like to receive in output

-------------------------------------------------------------------------------------------------
| product_id_a | product_id_b | product | start_time_a | end_time_a | start_time_b | end_time_b |
-------------------------------------------------------------------------------------------------
|       1      |       2      | bottle  |      2       |     4      |      3       |     5      |
|       1      |       3      | bottle  |      2       |     4      |      2       |     3      |

Because bottle_1 has overlapping times with bottle_2 and bottle_3, where 2 records are overlapping if the following conditions are satisfied:

  • max(a.start_time, b.start_time) < min(a.end_time, b.end_time)
  • !(a.start_time == b.start_time && a.end_time == b.end_time)
  • a.start_time != b.start_time || a.end_time != b.end_time

where the last 2 conditions just specify that I am not interested in cases in which start_time and end_time are equal (e.g. can_1 and can_3 are not in the expected result even if they have same start_time and end_time).

For how the problem is structured is easy to think to a solution with MapReduce using RDD, but I am interested to a solution with Dataframes.

HINT: is there any chance that with groupBy().agg() is possible to specify an interesting condition that reaches the described logic?

For any further explanation feel free to ask

NOT DUPLICATE of How to aggregate over rolling time window with groups in Spark

Unfortunately in the reported answer has been used F.lag, which in my case is not a good enough condition: F.lag uses a comparison with the previous record only, but in the reported example would not work as desired since that bottle_1 would not be reported as overlapping with bottle_3 because they are not consecutive records

Vzzarr
  • 4,600
  • 2
  • 43
  • 80

3 Answers3

2

Each of the conditions can be translated directly into SQL

from pyspark.sql.functions import col, least, greatest

cond1 = (
    greatest(col("a.start_time"), col("b.start_time")) < 
    least(col("a.end_time"), col("b.end_time"))
)

cond2 = ~(
    (col("a.start_time") == col("b.start_time")) & 
    (col("a.end_time") == col("b.end_time"))
)

cond3 = (
    (col("a.start_time") != col("b.start_time")) | 
    (col("a.end_time") != col("b.end_time"))
)

so you can just join and filter.

(df.alias("a").join(df.alias("b"), ["product"]).filter(cond1 & cond2 & cond3))
1

Based on @Andronicus solution I came up with this approach in pure Python.

It was necessary to join the DataFrame with himself to check if rows overlap. Of course, you need to omit self with the condition df.product_id < duplicate_df.product_id (two same Row and the reversed product_ids overlap).

The whole code:

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [(1, "bottle", 2, 4),
     (2, "bottle", 3, 5),
     (3, "bottle", 2, 3),
     (4, "bottle", 6, 7),
     (1, "can", 2, 4),
     (2, "can", 5, 6),
     (3, "can", 2, 4)], 
     ['product_id', 'product', 'start_time', 'end_time'])

duplicate_df = df

conditions = [df.product == duplicate_df.product,
              df.product_id < duplicate_df.product_id,
              df.start_time != duplicate_df.start_time, 
              df.end_time != duplicate_df.end_time,
              F.least(df.end_time, duplicate_df.end_time) >
              F.greatest(df.start_time, duplicate_df.start_time)]

df.join(duplicate_df, conditions)
Vzzarr
  • 4,600
  • 2
  • 43
  • 80
0

Try this:

df.join(cloneDf, $"label").where($"label" !== $"label1").where($"min" < $"max1").where($"min1" < $"max").show()

You need to make Cartesian product of DataFrame to check, if rows overlap, you can map them then however you want. Of course, you need to omit self - two same Rows overlap.

The whole code:

val df = SparkEmbedded.ss.createDataFrame(Seq(
  (1, 2, 5),
  (2, 4, 7),
  (3, 6, 9)
)).toDF("product_id", "min", "max")
import SparkEmbedded.ss.implicits._
val cloneDf = df.select(df.columns.map(col):_*)
    .withColumnRenamed("product_id", "product_id1")
    .withColumnRenamed("min", "min1")
    .withColumnRenamed("max", "max1")
df.crossJoin(cloneDf)
  .where($"product_id" < $"product_id1")
  .where($"min" < $"max1")
  .where($"min1" < $"max").show()

I have split where clause for clarity.

The result is:

+-----+---+---+------+----+----+
|label|min|max|label1|min1|max1|
+-----+---+---+------+----+----+
|    1|  2|  5|     2|   4|   7|
|    2|  4|  7|     3|   6|   9|
+-----+---+---+------+----+----+

The example is in Scala, but Python has similar API.

Vzzarr
  • 4,600
  • 2
  • 43
  • 80
Andronicus
  • 25,419
  • 17
  • 47
  • 88
  • thanks a lot! Both solutions (yours and @user11013893) helped me to solve my problem! Your solution has `where($"label" < $"label1")` which helped me to avoid duplicated results, while the other provided more interesting functions. Probably the `crossJoin` was not necessary: a normal inner join based on `product` (see the example) was enough. I don't understand why my question was down-voted (same for your answer which I up-voted). But I think that your answer was more useful with the use of the `< condition` in `where($"label" < $"label1")`. I'll accept yours – Vzzarr Feb 05 '19 at 10:19
  • to help readability of the answer, could you edit your post and make it closer to the example that I reported? So change `where($"label" < $"label1")` to `where($"product_id" < $"product_id1")`, the join based on `product`... and so on – Vzzarr Feb 05 '19 at 10:22
  • I suggest putting the python code in separate answer as it changes the programming style completely. – Andronicus Feb 05 '19 at 16:19
  • I started working with ML with python, but it was too dynamic for me. Scala is perfect (professionally I develop in Java) ;> – Andronicus Feb 05 '19 at 16:44