My Spark 2.4.x (pyspark) app requires:
- Inputs are two Kafka topics and output is a Kafka topic
- A "streaming table" where
- there's a logical key(s) and
- remaining columns should be latest values from either stream(s).
- Sub-second latency. Tests show this is achievable when
watermarks
are not used.
This seems like a basic thing, but its not working completely for me.
Example:
NOTE: In example below, T1, T2 & T2 points-in-time could be seconds/minutes/hours apart.
T1) At Time T1
KafkaPriceTopic gets 1 message payload (lets call it P1):
{ "SecurityCol":"Sec1", "PriceSeqNoCol":"1", "PriceCol": "101.5"}
KafkaVolumeTopic 1 message with payload (lets call it V1):
{ "SecurityCol":"Sec1", "VolumeSeqNoCol":"1", "VolumeCol": "50"}
I'd like to have a Result DataFrame
that looks like:
+-----------+--------+---------+-------------+--------------+
|SecurityCol|PriceCol|VolumeCol|PriceSeqNoCol|VolumeSeqNoCol|
+-----------+--------+---------+-------------+--------------+
|Sec1 |101.5 |50 |1 |1 |
+-----------+--------+---------+-------------+--------------+
T2) KafkaPriceTopic 1 message (P2):
{ "SecurityCol":"Sec1", "PriceSeqNoCol":"2", "PriceCol": "101.6"}
Result DataFrame
+-----------+--------+---------+-------------+--------------+
|SecurityCol|PriceCol|VolumeCol|PriceSeqNoCol|VolumeSeqNoCol|
+-----------+--------+---------+-------------+--------------+
|Sec1 |101.6 |50 |2 |1 |
+-----------+--------+---------+-------------+--------------+
NOTE: P1 not relevant anymore
T3) KafkaVolumeTopic 1 message V2:
{ "SecurityCol":"Sec1", "VolumeSeqNoCol":"2", "VolumeCol": "60"}
Result DataFrame
+-----------+--------+---------+-------------+--------------+
|SecurityCol|PriceCol|VolumeCol|PriceSeqNoCol|VolumeSeqNoCol|
+-----------+--------+---------+-------------+--------------+
|Sec1 |101.6 |60 |2 |2 |
+-----------+--------+---------+-------------+--------------+
NOTE: P1 & V1 not relevant anymore
What Works
- extract the json from the payload (
get_json_object
for now),join
the two topic's streams. - However. this would yield (w/o
watermark
) aDataFrame
that has all the Price & Volume received for Sec1 and not just the latest of either. - So this is followed by a
groupBy(...).agg(last(...),...)
. But am stuck on just getting one row with the latest value.
dfKafka1 = spark.readStream.format("kafka"). #remaining options etc
.load()
.select(...) #pulls out fields as columns"
dfKafka2 = spark.readStream.format("kafka"). #remaining options etc
.load()
.select(...) #pulls out fields as columns"
dfResult=dfKafka1.join(dfKafka2,"SecurityCol")
#structured streaming doesnt yet allow groupBy after a join, so write to intermediate kafka topic
dfResult.writestream.format("kafka"). #remaining options
.trigger(processingTime="1 second")
.start()
#load intermediate kafka topic
dfKafkaResult=spark.readStream.format("kafka"). #remaining options
.load()
.select(...) #get_json_object for cols
.groupBy("SecurityCol") #define the "key" to agg cols
.agg(last("PriceCol"), #most recent value per col
last("PriceSeqNoCol"),
last("VolumeCol"),
last("VolumeSeqNoCol"))
Problem
However the final agg
& last()
doesn't do the trick consistently.
- When KafkaVolumeTopic gets a new message, the result might have a join with an older message from KafkaPriceTopic.
- Further
orderBy
/sort cant be used on a stream w/o an aggregation.
Restrictions
- I cant
groupBy
before thejoin
since that would requirewithWatermark
, and i think my app cant usewatermark
. Rationale:- The app should be able to join the two topics for a given SecurityCol any time during the day.
- If PriceTopic gets a message at 9am and VolumeTopic at 10am
- i'd expect the two to be joined and present
- A watermark restricts when data is emitted in
append
mode. So cant use watermark here since the timeframe is whole day.
- The app should be able to join the two topics for a given SecurityCol any time during the day.
Any ideas?