Here is the code necessary to join the DataFrames:
I have commented the code so as to explain the logic.
TL;DR
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
Full breakdown
Let's start off with your DataFrames: df1
and df2
in code:
val df1 = List(
("2022-04-05 08:32:00", 1, "abc"),
("2022-04-05 10:15:00", 2, "abc"),
("2022-04-05 12:15:00", 3, "abc"),
("2022-04-05 09:00:00", 1, "xyz"),
("2022-04-05 20:20:00", 2, "xyz")
).toDF("time_1", "revision", "review_1")
df1.show(false)
gives:
+-------------------+--------+--------+
|time_1 |revision|review_1|
+-------------------+--------+--------+
|2022-04-05 08:32:00|1 |abc |
|2022-04-05 10:15:00|2 |abc |
|2022-04-05 12:15:00|3 |abc |
|2022-04-05 09:00:00|1 |xyz |
|2022-04-05 20:20:00|2 |xyz |
+-------------------+--------+--------+
val df2 = List(
("2022-04-05 08:30:00", "abc", "value_1"),
("2022-04-05 09:48:00", "abc", "value_2"),
("2022-04-05 15:40:00", "abc", "value_3"),
("2022-04-05 08:00:00", "xyz", "value_4"),
("2022-04-05 09:00:00", "xyz", "value_5"),
("2022-04-05 10:00:00", "xyz", "value_6"),
("2022-04-05 11:00:00", "xyz", "value_7"),
("2022-04-05 12:00:00", "xyz", "value_8")
).toDF("time_2", "review_2", "value")
df2.show(false)
gives:
+-------------------+--------+-------+
|time_2 |review_2|value |
+-------------------+--------+-------+
|2022-04-05 08:30:00|abc |value_1|
|2022-04-05 09:48:00|abc |value_2|
|2022-04-05 15:40:00|abc |value_3|
|2022-04-05 08:00:00|xyz |value_4|
|2022-04-05 09:00:00|xyz |value_5|
|2022-04-05 10:00:00|xyz |value_6|
|2022-04-05 11:00:00|xyz |value_7|
|2022-04-05 12:00:00|xyz |value_8|
+-------------------+--------+-------+
Next we need new columns which we can do the date range check on (where time is represented as a single number, making math operations easy:
// add a new column, temporarily, which contains the time in
// epoch format: with this adding/subtracting an hour can easily be done.
val df1WithEpoch = df1.withColumn("epoch_time_1", unix_timestamp(col("time_1")))
val df2WithEpoch = df2.withColumn("epoch_time_2", unix_timestamp(col("time_2")))
df1WithEpoch.show()
df2WithEpoch.show()
gives:
+-------------------+--------+--------+------------+
| time_1|revision|review_1|epoch_time_1|
+-------------------+--------+--------+------------+
|2022-04-05 08:32:00| 1| abc| 1649147520|
|2022-04-05 10:15:00| 2| abc| 1649153700|
|2022-04-05 12:15:00| 3| abc| 1649160900|
|2022-04-05 09:00:00| 1| xyz| 1649149200|
|2022-04-05 20:20:00| 2| xyz| 1649190000|
+-------------------+--------+--------+------------+
+-------------------+--------+-------+------------+
| time_2|review_2| value|epoch_time_2|
+-------------------+--------+-------+------------+
|2022-04-05 08:30:00| abc|value_1| 1649147400|
|2022-04-05 09:48:00| abc|value_2| 1649152080|
|2022-04-05 15:40:00| abc|value_3| 1649173200|
|2022-04-05 08:00:00| xyz|value_4| 1649145600|
|2022-04-05 09:00:00| xyz|value_5| 1649149200|
|2022-04-05 10:00:00| xyz|value_6| 1649152800|
|2022-04-05 11:00:00| xyz|value_7| 1649156400|
|2022-04-05 12:00:00| xyz|value_8| 1649160000|
+-------------------+--------+-------+------------+
and finally to join
:
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
gives:
+-------------------+--------+--------+-------+
|time_1 |revision|review_1|value |
+-------------------+--------+--------+-------+
|2022-04-05 08:32:00|1 |abc |value_1|
|2022-04-05 10:15:00|2 |abc |value_2|
|2022-04-05 12:15:00|3 |abc |null |
|2022-04-05 09:00:00|1 |xyz |value_6|
|2022-04-05 20:20:00|2 |xyz |null |
+-------------------+--------+--------+-------+