How to build a dataframe thats a join of two kafka streams with "key(s)" column and remaining columns being the latest values

Question

My Spark 2.4.x (pyspark) app requires:

Inputs are two Kafka topics and output is a Kafka topic
A "streaming table" where
- there's a logical key(s) and
- remaining columns should be latest values from either stream(s).
Sub-second latency. Tests show this is achievable when watermarks are not used.

This seems like a basic thing, but its not working completely for me.

Example:

NOTE: In example below, T1, T2 & T2 points-in-time could be seconds/minutes/hours apart.

T1) At Time T1

KafkaPriceTopic gets 1 message payload (lets call it P1):
{ "SecurityCol":"Sec1", "PriceSeqNoCol":"1", "PriceCol": "101.5"}

KafkaVolumeTopic 1 message with payload (lets call it V1):
{ "SecurityCol":"Sec1", "VolumeSeqNoCol":"1", "VolumeCol": "50"}

I'd like to have a Result DataFrame that looks like:

+-----------+--------+---------+-------------+--------------+ 
|SecurityCol|PriceCol|VolumeCol|PriceSeqNoCol|VolumeSeqNoCol|  
+-----------+--------+---------+-------------+--------------+ 
|Sec1       |101.5   |50       |1            |1             |
+-----------+--------+---------+-------------+--------------+

T2) KafkaPriceTopic 1 message (P2):
{ "SecurityCol":"Sec1", "PriceSeqNoCol":"2", "PriceCol": "101.6"}

Result DataFrame

+-----------+--------+---------+-------------+--------------+ 
|SecurityCol|PriceCol|VolumeCol|PriceSeqNoCol|VolumeSeqNoCol|  
+-----------+--------+---------+-------------+--------------+ 
|Sec1       |101.6   |50       |2            |1             |
+-----------+--------+---------+-------------+--------------+

NOTE: P1 not relevant anymore

T3) KafkaVolumeTopic 1 message V2:
{ "SecurityCol":"Sec1", "VolumeSeqNoCol":"2", "VolumeCol": "60"}

Result DataFrame

+-----------+--------+---------+-------------+--------------+ 
|SecurityCol|PriceCol|VolumeCol|PriceSeqNoCol|VolumeSeqNoCol|
+-----------+--------+---------+-------------+--------------+
|Sec1       |101.6   |60       |2            |2             |
+-----------+--------+---------+-------------+--------------+

NOTE: P1 & V1 not relevant anymore

What Works

extract the json from the payload (get_json_object for now), join the two topic's streams.
However. this would yield (w/o watermark) a DataFrame that has all the Price & Volume received for Sec1 and not just the latest of either.
So this is followed by a groupBy(...).agg(last(...),...). But am stuck on just getting one row with the latest value.

    dfKafka1 = spark.readStream.format("kafka"). #remaining options etc
                .load()
                .select(...)                     #pulls out fields as columns" 

    dfKafka2 = spark.readStream.format("kafka"). #remaining options etc
               .load()
               .select(...)                     #pulls out fields as columns" 

    dfResult=dfKafka1.join(dfKafka2,"SecurityCol")

#structured streaming doesnt yet allow groupBy after a join, so write to intermediate kafka topic

   dfResult.writestream.format("kafka"). #remaining options
           .trigger(processingTime="1 second")
           .start()

#load intermediate kafka topic
   dfKafkaResult=spark.readStream.format("kafka"). #remaining options
                 .load()
                 .select(...)                      #get_json_object for cols
                 .groupBy("SecurityCol")           #define the "key" to agg cols
                 .agg(last("PriceCol"),            #most recent value per col
                      last("PriceSeqNoCol"),
                      last("VolumeCol"),
                      last("VolumeSeqNoCol"))

Problem

However the final agg & last() doesn't do the trick consistently.

When KafkaVolumeTopic gets a new message, the result might have a join with an older message from KafkaPriceTopic.
Further orderBy/sort cant be used on a stream w/o an aggregation.

Restrictions

I cant groupBy before the join since that would require withWatermark, and i think my app cant use watermark. Rationale:
- The app should be able to join the two topics for a given SecurityCol any time during the day.
  - If PriceTopic gets a message at 9am and VolumeTopic at 10am
  - i'd expect the two to be joined and present
- A watermark restricts when data is emitted in append mode. So cant use watermark here since the timeframe is whole day.

Any ideas?

Hi, were you able to solve the issue ? My case is also similar. — Sunil, Feb 15 '22 at 06:32

How to build a dataframe thats a join of two kafka streams with "key(s)" column and remaining columns being the latest values

Example:

What Works

Problem

Restrictions

0 Answers0