How to perform inner join between two dataframes based upon difference between two columns of two different data frames

Question

I would like to merge two data frames based upon both a time period, and lat lon coordinates.

I originally performed an outer product to construct distances between the two data-frames using a window function. However this created an enormous data explosion, and shut down my cluster whenever I tried to run it (I can include this code if requested). In response, I decided to perform a sequence of inner joins in order to avoid this outer product. Put simply, I joined on the absolute value of the difference being equal to some specific value until the remaining un-matched coordinates could be merged using the naive window approach. I have looked around but not found any Pyspark code in the stack that has explicitly dealt with this problem so any help is appreciated.

# starting with exact
conditions = [df1.lon1 == df2.lon2,
                  df1.lat1 == df2.lat2,
                  df2.start <= df1.time,
                  df2.end >= df1.time]
current_df_hit = df1.join(df2, conditions, 'inner')

....
# then 1 off 
conditions = [F.abs(df1.lon1 - df2.lon2) == 1,
                  df1.lat1 == df2.lat2,
                  df2.start <= df1.time,
                  df2.end >= df1.time]
current_df_hit = df1.join(df2, conditions, 'inner')
...
conditions = [df1.lon1 == df2.lon2,
              F.abs(df1.lat1 - df2.lat2==1,
                  df2.start <= df1.time,
                  df2.end >= df1.time]
current_df_hit = df1.join(df2, conditions, 'inner')
...
# then both ect.
conditions = [F.abs(df1.lon1 - df2.lon2) == 1,
              F.abs(df1.lat1 - df2.lat2==1,
                  df2.start <= df1.time,
                  df2.end >= df1.time]
current_df_hit = df1.join(df2, conditions, 'inner')

this does not produce the expected results. For example, run the following:

df0 = spark.createDataFrame([
    ('id1', 10.1, 12.1),
    ('id5', 13.1, 13.1)], ["id0", "lat0", "lon0"])
df1 = spark.createDataFrame([
    ('id1', 10.1, 12.3),
    ('id5', 13.1, 13.2)], ["id1", "lat1", "lon1"])
# this produces nothing 
df0.join(df1, F.abs(df1.lon1 - df0.lon0) == 0.1, 'inner').show()
# and so does this
df0.join(df1, F.round(df1.lon1 - df0.lon0, 1) == 0.1).show()```

which produces nothing. Please advise, and thanks in advance!

score 0 · Answer 1 · answered Jul 29 '19 at 18:29

So the reasoning behind this question was actually pretty weak. It turns that you can apply basic operators in a withColumn operations after a join. However in the context that I posed the question, I found a better solution in the following code that does not generate as large of an intermediary data-frame for my application:

df0 = spark.createDataFrame([
    ('id1', "2015-04-27 00:00:00", "2015-04-28 00:00:00", 10.1, 12.3, 10, 12),
    ('id1', "2015-04-29 00:00:00", "2015-04-30 00:00:00", 10.1, 12.1, 10, 12),
    ('id5', "2015-04-28 00:00:00", "2015-04-29 00:00:00", 13.1, 13.4, 13, 13),
    ('id5', "2015-04-28 00:00:00", "2015-04-29 00:00:00", 13.1, 13.1, 13, 13)], ["id0", "start", "end", "lat0", "lon0", "rlat0", "rlon0",])
df1 = spark.createDataFrame([
    ('id2', "2015-04-29 00:00:00", 10.1, 12.3, 10, 12),
    ('id2', "2015-04-28 00:00:00", 10.1, 12.5, 10, 12),
    ('id3', "2015-04-28 00:00:00", 13.1, 13.2, 13, 13)], ["id1", "date", "lat1", "lon1", "rlat1", "rlon1"])

from pyspark.sql import functions as F
# first join on the rounded value  
joindf = df1.join(df0, [df0.rlat0 == df1.rlat1, df0.rlon0 == df1.rlon1, df0.start <= df1.date,df0.end >= df1.date])
joindf.show()
# compute distance 
joindf = joindf.withColumn("dist", F.abs(joindf.lat1 - joindf.lat0)+F.abs(joindf.lon1 - joindf.lon0))
joindf.show()
# order the distance 
joindf = joindf.withColumn("rank", F.row_number().over(Window.partitionBy("rlat0", "rlon0","rlat1", "rlon1").orderBy("dist")))
joindf.show()
joindf.filter(F.col("rank") == 1).show()

Where I generate intermediary joins through rounding instead of taking subsets over the data frames of interest.

How to perform inner join between two dataframes based upon difference between two columns of two different data frames

1 Answers1