Performing batch join in pyspark from timestamp and id using watermarking

Question

I have two dataframes with on equipment and equipment surrounding information. A row from the on equipment dataframe has the following schema:

on_equipment_df.head()

# Output of notebook:
Row(
on_eq_equipid='1', 
on_eq_timestamp=datetime.datetime(2020, 10, 7, 15, 27, 42, 866098), 
on_eq_charge=3.917107463423725, on_eq_electric_consumption=102.02754516792204, on_eq_force=10.551710736897613, on_eq_humidity=22.663245200558457, on_eq_pressure=10.813417893943944, on_eq_temperature=29.80448721128125, on_eq_vibration=4.376662536641158, 
measurement_status='Bad')

And a row from the equipment surrounding looks like:

equipment_surrounding_df.head()

# Output of notebook
Row(
eq_surrounding_equipid='1', 
eq_surrounding_timestamp=datetime.datetime(2020, 10, 7, 15, 27, 42, 903198),
eq_surrounding_dust=24.0885168316774, eq_surrounding_humidity=16.949569353381793, eq_surrounding_noise=12.256649392702574, eq_surrounding_temperature=8.141877435145844,
measurement_status='Good')

Notice that both tables have id's identifying the equipment and a timestamp relating to when the measurement was taken.

Problem: I want to perform a join between these two dataframes based on equipment id and the timestamps. The issue is that the timestamp is recorded with a very precise reading rendering the join through timestamp impossible (unless I round the timestamp, which I would like to leave as a last resort). The on equipment and equipment surrounding readings are recorded at different frequencies. I want therefore to perform a join only based on the equipment id but for windows between certain timestamp values. This is similar to what is done in structured streaming using watermarking.

To do this I tried to use the equivelent operation used in structured streaming as mentioned above called watermarking. Here is the code:

# Add watermark defined by the timestamp column of each df
watermarked_equipment_surr = equipment_surrounding_df.withWatermark("eq_surr_timestamp", "0.2 seconds")
watermarked_on_equipment = on_equipment_df.withWatermark("on_eq_timestamp", "0.2 seconds")

# Define new equipment dataframe based on the watermarking conditions described
equipment_df = watermarked_on_equipment.join(
    watermarked_equipment_surr, 
    expr("""
        on_eq_equipid = eq_surr_equipid AND
        on_eq_timestamp >= eq_surrounding_timestamp AND
        on_eq_timestamp <= eq_surrounding_timestamp + interval 0.2 seconds
        """))

# Perfrom a count (error appears here)
print("Equipment size:", equipment_df.count())

I get an error when performing this action. Based on this, I have two questions:

Is this the right way to solve such a use case / problem?
If so, why do I get an error in my code?

Thank you in advance

UPDATE:

So I beleive I found half the solution inspiered by:

Joining two spark dataframes on time (TimestampType) in python

Essentially the solution goes through creating two columns in one of the dataframes which represents an upper and lower limit of the timestamp using udfs. The code for that:


def lower_range_func(x, offset_milli=250):
    """
    From a timestamp and offset, get the timestamp obtained from subtractng the offset.
    """
    return x - timedelta(seconds=offset_milli/1000)

def upper_range_func(x,  offset_milli=250):
    """
    From a timestamp and offset, get the timestamp obtained from adding the offset.
    """
    return x + timedelta(seconds=offset_milli/1000)

# Create two dataframes
lower_range = udf(lower_range_func, TimestampType())
upper_range = udf(upper_range_func, TimestampType())

# Add these columns to the on_equipment dataframe
on_equipment_df = on_equipment_df\
    .withColumn('lower_on_eq_timestamp', lower_range(on_equipment_df["on_eq_timestamp"]))\
    .withColumn('upper_on_eq_timestamp', upper_range(on_equipment_df["on_eq_timestamp"]))

Once we have those columns, we can perform a filtered join using these new columns.

# Join dataframes based on a filtered join
equipment_df = on_equipment_df.join(equipment_surrounding_df)\
    .filter(on_equipment_df.on_eq_timestamp > equipment_surrounding_df.lower_eq_surr_timestamp)\
    .filter(on_equipment_df.on_eq_timestamp < equipment_surrounding_df.upper_eq_surr_timestamp)

The problem is, as soon as I try to join using also the equipment id, like so:

# Join dataframes based on a filtered join
equipment_df = on_equipment_df.join(
equipment_surrounding_df, on_equipment_df.on_eq_equipid == equipment_surrounding_df.eq_surr_equipid)\
    .filter(on_equipment_df.on_eq_timestamp > equipment_surrounding_df.lower_eq_surr_timestamp)\
    .filter(on_equipment_df.on_eq_timestamp < equipment_surrounding_df.upper_eq_surr_timestamp)

I get the an error. Any thoughts on this approach?

Performing batch join in pyspark from timestamp and id using watermarking

0 Answers0