2

My Spark 2.4.x (pyspark) app requires:

  1. Inputs are two Kafka topics and output is a Kafka topic
  2. A "streaming table" where
    • there's a logical key(s) and
    • remaining columns should be latest values from either stream(s).
  3. Sub-second latency. Tests show this is achievable when watermarks are not used.

This seems like a basic thing, but its not working completely for me.


Example:

NOTE: In example below, T1, T2 & T2 points-in-time could be seconds/minutes/hours apart.

T1) At Time T1

KafkaPriceTopic gets 1 message payload (lets call it P1):
{ "SecurityCol":"Sec1", "PriceSeqNoCol":"1", "PriceCol": "101.5"}

KafkaVolumeTopic 1 message with payload (lets call it V1):
{ "SecurityCol":"Sec1", "VolumeSeqNoCol":"1", "VolumeCol": "50"}

I'd like to have a Result DataFrame that looks like:

+-----------+--------+---------+-------------+--------------+ 
|SecurityCol|PriceCol|VolumeCol|PriceSeqNoCol|VolumeSeqNoCol|  
+-----------+--------+---------+-------------+--------------+ 
|Sec1       |101.5   |50       |1            |1             |
+-----------+--------+---------+-------------+--------------+ 

T2) KafkaPriceTopic 1 message (P2):
{ "SecurityCol":"Sec1", "PriceSeqNoCol":"2", "PriceCol": "101.6"}

Result DataFrame

+-----------+--------+---------+-------------+--------------+ 
|SecurityCol|PriceCol|VolumeCol|PriceSeqNoCol|VolumeSeqNoCol|  
+-----------+--------+---------+-------------+--------------+ 
|Sec1       |101.6   |50       |2            |1             |
+-----------+--------+---------+-------------+--------------+ 

NOTE: P1 not relevant anymore

T3) KafkaVolumeTopic 1 message V2:
{ "SecurityCol":"Sec1", "VolumeSeqNoCol":"2", "VolumeCol": "60"}

Result DataFrame

+-----------+--------+---------+-------------+--------------+ 
|SecurityCol|PriceCol|VolumeCol|PriceSeqNoCol|VolumeSeqNoCol|
+-----------+--------+---------+-------------+--------------+
|Sec1       |101.6   |60       |2            |2             |
+-----------+--------+---------+-------------+--------------+ 

NOTE: P1 & V1 not relevant anymore


What Works

  1. extract the json from the payload (get_json_object for now), join the two topic's streams.
  2. However. this would yield (w/o watermark) a DataFrame that has all the Price & Volume received for Sec1 and not just the latest of either.
  3. So this is followed by a groupBy(...).agg(last(...),...). But am stuck on just getting one row with the latest value.
    dfKafka1 = spark.readStream.format("kafka"). #remaining options etc
                .load()
                .select(...)                     #pulls out fields as columns" 

    dfKafka2 = spark.readStream.format("kafka"). #remaining options etc
               .load()
               .select(...)                     #pulls out fields as columns" 

    dfResult=dfKafka1.join(dfKafka2,"SecurityCol")

#structured streaming doesnt yet allow groupBy after a join, so write to intermediate kafka topic

   dfResult.writestream.format("kafka"). #remaining options
           .trigger(processingTime="1 second")
           .start()

#load intermediate kafka topic
   dfKafkaResult=spark.readStream.format("kafka"). #remaining options
                 .load()
                 .select(...)                      #get_json_object for cols
                 .groupBy("SecurityCol")           #define the "key" to agg cols
                 .agg(last("PriceCol"),            #most recent value per col
                      last("PriceSeqNoCol"),
                      last("VolumeCol"),
                      last("VolumeSeqNoCol"))


Problem

However the final agg & last() doesn't do the trick consistently.

  1. When KafkaVolumeTopic gets a new message, the result might have a join with an older message from KafkaPriceTopic.
  2. Further orderBy/sort cant be used on a stream w/o an aggregation.

Restrictions

  1. I cant groupBy before the join since that would require withWatermark, and i think my app cant use watermark. Rationale:
    • The app should be able to join the two topics for a given SecurityCol any time during the day.
      • If PriceTopic gets a message at 9am and VolumeTopic at 10am
      • i'd expect the two to be joined and present
    • A watermark restricts when data is emitted in append mode. So cant use watermark here since the timeframe is whole day.

Any ideas?

Venki
  • 417
  • 4
  • 7

0 Answers0