1

I would like to merge two data frames based upon both a time period, and lat lon coordinates.

I originally performed an outer product to construct distances between the two data-frames using a window function. However this created an enormous data explosion, and shut down my cluster whenever I tried to run it (I can include this code if requested). In response, I decided to perform a sequence of inner joins in order to avoid this outer product. Put simply, I joined on the absolute value of the difference being equal to some specific value until the remaining un-matched coordinates could be merged using the naive window approach. I have looked around but not found any Pyspark code in the stack that has explicitly dealt with this problem so any help is appreciated.

# starting with exact
conditions = [df1.lon1 == df2.lon2,
                  df1.lat1 == df2.lat2,
                  df2.start <= df1.time,
                  df2.end >= df1.time]
current_df_hit = df1.join(df2, conditions, 'inner')

....
# then 1 off 
conditions = [F.abs(df1.lon1 - df2.lon2) == 1,
                  df1.lat1 == df2.lat2,
                  df2.start <= df1.time,
                  df2.end >= df1.time]
current_df_hit = df1.join(df2, conditions, 'inner')
...
conditions = [df1.lon1 == df2.lon2,
              F.abs(df1.lat1 - df2.lat2==1,
                  df2.start <= df1.time,
                  df2.end >= df1.time]
current_df_hit = df1.join(df2, conditions, 'inner')
...
# then both ect.
conditions = [F.abs(df1.lon1 - df2.lon2) == 1,
              F.abs(df1.lat1 - df2.lat2==1,
                  df2.start <= df1.time,
                  df2.end >= df1.time]
current_df_hit = df1.join(df2, conditions, 'inner')

this does not produce the expected results. For example, run the following:

df0 = spark.createDataFrame([
    ('id1', 10.1, 12.1),
    ('id5', 13.1, 13.1)], ["id0", "lat0", "lon0"])
df1 = spark.createDataFrame([
    ('id1', 10.1, 12.3),
    ('id5', 13.1, 13.2)], ["id1", "lat1", "lon1"])
# this produces nothing 
df0.join(df1, F.abs(df1.lon1 - df0.lon0) == 0.1, 'inner').show()
# and so does this
df0.join(df1, F.round(df1.lon1 - df0.lon0, 1) == 0.1).show()```

which produces nothing. Please advise, and thanks in advance!

1 Answers1

0

So the reasoning behind this question was actually pretty weak. It turns that you can apply basic operators in a withColumn operations after a join. However in the context that I posed the question, I found a better solution in the following code that does not generate as large of an intermediary data-frame for my application:

df0 = spark.createDataFrame([
    ('id1', "2015-04-27 00:00:00", "2015-04-28 00:00:00", 10.1, 12.3, 10, 12),
    ('id1', "2015-04-29 00:00:00", "2015-04-30 00:00:00", 10.1, 12.1, 10, 12),
    ('id5', "2015-04-28 00:00:00", "2015-04-29 00:00:00", 13.1, 13.4, 13, 13),
    ('id5', "2015-04-28 00:00:00", "2015-04-29 00:00:00", 13.1, 13.1, 13, 13)], ["id0", "start", "end", "lat0", "lon0", "rlat0", "rlon0",])
df1 = spark.createDataFrame([
    ('id2', "2015-04-29 00:00:00", 10.1, 12.3, 10, 12),
    ('id2', "2015-04-28 00:00:00", 10.1, 12.5, 10, 12),
    ('id3', "2015-04-28 00:00:00", 13.1, 13.2, 13, 13)], ["id1", "date", "lat1", "lon1", "rlat1", "rlon1"])

from pyspark.sql import functions as F
# first join on the rounded value  
joindf = df1.join(df0, [df0.rlat0 == df1.rlat1, df0.rlon0 == df1.rlon1, df0.start <= df1.date,df0.end >= df1.date])
joindf.show()
# compute distance 
joindf = joindf.withColumn("dist", F.abs(joindf.lat1 - joindf.lat0)+F.abs(joindf.lon1 - joindf.lon0))
joindf.show()
# order the distance 
joindf = joindf.withColumn("rank", F.row_number().over(Window.partitionBy("rlat0", "rlon0","rlat1", "rlon1").orderBy("dist")))
joindf.show()
joindf.filter(F.col("rank") == 1).show()

Where I generate intermediary joins through rounding instead of taking subsets over the data frames of interest.