0

I am looking for help in joining 2 DF's with conditional join in time columns, using Spark Scala.

DF1

time_1 revision review_1
2022-04-05 08:32:00 1 abc
2022-04-05 10:15:00 2 abc
2022-04-05 12:15:00 3 abc
2022-04-05 09:00:00 1 xyz
2022-04-05 20:20:00 2 xyz

DF2:

time_2 review_1 value
2022-04-05 08:30:00 abc value_1
2022-04-05 09:48:00 abc value_2
2022-04-05 15:40:00 abc value_3
2022-04-05 08:00:00 xyz value_4
2022-04-05 09:00:00 xyz value_5
2022-04-05 10:00:00 xyz value_6
2022-04-05 11:00:00 xyz value_7
2022-04-05 12:00:00 xyz value_8

Desired Output DF:

time_1 revision review_1 value
2022-04-05 08:32:00 1 abc value_1
2022-04-05 10:15:00 2 abc value_2
2022-04-05 12:15:00 3 abc null
2022-04-05 09:00:00 1 xyz value_6
2022-04-05 20:20:00 2 xyz null

As in the case of row 4 of the final output (where time_1 = 2022-04-05 09:00:00, if multiple values match during the join then only the latest - in time - should be taken).

Furthermore if there is no match for a row of df in the join then there it should have null for the value column.

Here we need to join between 2 columns in the two DF's:

  • review_1 === review_2 &&
  • time_1 === time_2 (condition : time_1 should be in range +1/-1 Hr from time_2, If multiple records then show latest value, as in value_6 above)
tjheslin1
  • 1,378
  • 6
  • 19
  • 36
HEMANT PATEL
  • 77
  • 1
  • 11
  • I would recommend editing your question to format your dataframe's to make them more readable. Though a python example, [this question](https://stackoverflow.com/a/30719662/2743131) is quite relevant and my have the full solution you need. – tjheslin1 Apr 05 '22 at 10:14

1 Answers1

0

Here is the code necessary to join the DataFrames:

I have commented the code so as to explain the logic.

TL;DR

import org.apache.spark.sql.expressions.Window

val SECONDS_IN_ONE_HOUR = 60 * 60

val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)

df1WithEpoch
    .join(df2WithEpoch, 
          df1WithEpoch("review_1") === df2WithEpoch("review_2") 
          && (
              // if `time_1` is in between `time_2` and `time_2` - 1 hour
              (df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
              // if `time_1` is in between `time_2` and `time_2` + 1 hour
              && (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
          ),
          // LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
          "left_outer"
         )
    .withColumn("row_num", row_number().over(window))
    .filter(col("row_num") === 1)
    // select only the columns we care about
    .select("time_1", "revision", "review_1", "value")
    // order by to give the results in the same order as in the Question
    .orderBy(col("review_1"), col("revision"))
    .show(false)

Full breakdown

Let's start off with your DataFrames: df1 and df2 in code:

val df1 = List(
    ("2022-04-05 08:32:00", 1, "abc"),
    ("2022-04-05 10:15:00", 2, "abc"),
    ("2022-04-05 12:15:00", 3, "abc"),
    ("2022-04-05 09:00:00", 1, "xyz"),
    ("2022-04-05 20:20:00", 2, "xyz")
).toDF("time_1", "revision", "review_1")

df1.show(false)

gives:

+-------------------+--------+--------+
|time_1             |revision|review_1|
+-------------------+--------+--------+
|2022-04-05 08:32:00|1       |abc     |
|2022-04-05 10:15:00|2       |abc     |
|2022-04-05 12:15:00|3       |abc     |
|2022-04-05 09:00:00|1       |xyz     |
|2022-04-05 20:20:00|2       |xyz     |
+-------------------+--------+--------+
val df2 = List(
    ("2022-04-05 08:30:00", "abc", "value_1"),
    ("2022-04-05 09:48:00", "abc", "value_2"),
    ("2022-04-05 15:40:00", "abc", "value_3"),
    ("2022-04-05 08:00:00", "xyz", "value_4"),
    ("2022-04-05 09:00:00", "xyz", "value_5"),
    ("2022-04-05 10:00:00", "xyz", "value_6"),
    ("2022-04-05 11:00:00", "xyz", "value_7"),
    ("2022-04-05 12:00:00", "xyz", "value_8")
).toDF("time_2", "review_2", "value")

df2.show(false)

gives:

+-------------------+--------+-------+
|time_2             |review_2|value  |
+-------------------+--------+-------+
|2022-04-05 08:30:00|abc     |value_1|
|2022-04-05 09:48:00|abc     |value_2|
|2022-04-05 15:40:00|abc     |value_3|
|2022-04-05 08:00:00|xyz     |value_4|
|2022-04-05 09:00:00|xyz     |value_5|
|2022-04-05 10:00:00|xyz     |value_6|
|2022-04-05 11:00:00|xyz     |value_7|
|2022-04-05 12:00:00|xyz     |value_8|
+-------------------+--------+-------+

Next we need new columns which we can do the date range check on (where time is represented as a single number, making math operations easy:

// add a new column, temporarily, which contains the time in 
// epoch format: with this adding/subtracting an hour can easily be done.
val df1WithEpoch = df1.withColumn("epoch_time_1", unix_timestamp(col("time_1")))
val df2WithEpoch = df2.withColumn("epoch_time_2", unix_timestamp(col("time_2")))

df1WithEpoch.show()
df2WithEpoch.show()

gives:

+-------------------+--------+--------+------------+
|             time_1|revision|review_1|epoch_time_1|
+-------------------+--------+--------+------------+
|2022-04-05 08:32:00|       1|     abc|  1649147520|
|2022-04-05 10:15:00|       2|     abc|  1649153700|
|2022-04-05 12:15:00|       3|     abc|  1649160900|
|2022-04-05 09:00:00|       1|     xyz|  1649149200|
|2022-04-05 20:20:00|       2|     xyz|  1649190000|
+-------------------+--------+--------+------------+

+-------------------+--------+-------+------------+
|             time_2|review_2|  value|epoch_time_2|
+-------------------+--------+-------+------------+
|2022-04-05 08:30:00|     abc|value_1|  1649147400|
|2022-04-05 09:48:00|     abc|value_2|  1649152080|
|2022-04-05 15:40:00|     abc|value_3|  1649173200|
|2022-04-05 08:00:00|     xyz|value_4|  1649145600|
|2022-04-05 09:00:00|     xyz|value_5|  1649149200|
|2022-04-05 10:00:00|     xyz|value_6|  1649152800|
|2022-04-05 11:00:00|     xyz|value_7|  1649156400|
|2022-04-05 12:00:00|     xyz|value_8|  1649160000|
+-------------------+--------+-------+------------+

and finally to join:

import org.apache.spark.sql.expressions.Window

val SECONDS_IN_ONE_HOUR = 60 * 60

val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)

df1WithEpoch
    .join(df2WithEpoch, 
          df1WithEpoch("review_1") === df2WithEpoch("review_2") 
          && (
              // if `time_1` is in between `time_2` and `time_2` - 1 hour
              (df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
              // if `time_1` is in between `time_2` and `time_2` + 1 hour
              && (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
          ),
          // LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
          "left_outer"
         )
    .withColumn("row_num", row_number().over(window))
    .filter(col("row_num") === 1)
    // select only the columns we care about
    .select("time_1", "revision", "review_1", "value")
    // order by to give the results in the same order as in the Question
    .orderBy(col("review_1"), col("revision"))
    .show(false)

gives:

+-------------------+--------+--------+-------+
|time_1             |revision|review_1|value  |
+-------------------+--------+--------+-------+
|2022-04-05 08:32:00|1       |abc     |value_1|
|2022-04-05 10:15:00|2       |abc     |value_2|
|2022-04-05 12:15:00|3       |abc     |null   |
|2022-04-05 09:00:00|1       |xyz     |value_6|
|2022-04-05 20:20:00|2       |xyz     |null   |
+-------------------+--------+--------+-------+
tjheslin1
  • 1,378
  • 6
  • 19
  • 36