2

I am using Spark structured streaming to read records from a Kafka topic; I intend to count the number of records received in each 'Micro batch' in Spark readstream

This is a snippet:

val kafka_df = sparkSession
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "test-count")
  .load()

I understand from the docs that kafka_df will be lazily evaluated when a streamingQuery is started (to come next), and as it is evaluated, it holds a micro-batch. So, I figured doing a groupBy on topic followed by a count should work.

Like this:

val counter = kafka_df
             .groupBy("topic")
             .count()

Now to evaluate all of this, we need a streaminQuery, lets say, a console sink query to print it on the console. And this is where i see the problem. A streamingQuery on aggregate DataFrames, such as kafka_df works only with outputMode complete/update and not on append.

This effectively means that, the count reported by the streamingQuery is cumulative.

Like this:

 val counter_json = counter.toJSON   //to jsonify 
 val count_query = counter_json
                   .writeStream.outputMode("update")
                   .format("console")
                   .start()          // kicks of lazy evaluation
                   .awaitTermination()  

In a controlled set up, where:
actual Published records: 1500
actual Received micro-batches : 3
aActual Received records: 1500

The count of each microbatch is supposed to be 500, so I hoped (wished) that the query prints to console:

topic: test-count
count: 500
topic: test-count
count: 500
topic: test-count
count: 500

But it doesn't. It actually prints:

topic: test-count
count: 500
topic: test-count
count:1000
topic: test-count
count: 1500

This I understand is because of 'outputMode' complete/update (cumulative)

My question: Is it possible to accurately get the count of each micro-batch is Spark-Kafka structured streaming?

From the docs, I found out about the watermark approach (to support append):

val windowedCounts = kafka_df
                    .withWatermark("timestamp", "10 seconds")
                    .groupBy(window($"timestamp", "10 seconds", "10       seconds"), $"topic")
                    .count()

 val console_query = windowedCounts
                    .writeStream
                    .outputMode("append")
                    .format("console")
                    .start()
                    .awaitTermination()

But the results of this console_query are inaccurate and appears is way off mark.

TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77
irrelevantUser
  • 1,172
  • 18
  • 35

2 Answers2

2

If you want to only process a specific number of records with every trigger within a Structured Streaming application using Kafka, use the option maxOffsetsPerTrigger

val kafka_df = sparkSession
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "test-count")
  .option("maxOffsetsPerTrigger", 500)
  .load()
zero323
  • 322,348
  • 103
  • 959
  • 935
bp2010
  • 2,342
  • 17
  • 34
  • I intend to count the number of records per batch and not limit it. This could also be extended to aggregating some numeric data (sourced from kafka) per batch. The problem for me here is in achieving 'append' mode, where 'aggregation' for each batch is sent to another Kafka Sink. Whereas, currently I observe that this aggregation is cumulative. – irrelevantUser Aug 14 '18 at 12:55
2

"TL;DR - Any thoughts on accurately counting the records in Spark-Kafka micro-batch would be appreciated."

You can count the records fetched from Kafka by using a StreamingQueryListener (ScalaDocs).

This allows you to print out the exact number of rows that were received from the subscribed Kafka topic. The onQueryProgress API gets called during every micro-batch and contains lots of useful meta information on your query. If no data is flowing into the query the onQueryProgress is called every 10 seconds. Below is a simple example that prints out the number of input messages.

spark.streams.addListener(new StreamingQueryListener() {
    override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {}

    override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {}

    override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
      println("NumInputRows: " + queryProgress.progress.numInputRows)
    }
  })

In case you are validating the performance of your Structured Streaming query, it is usually best to keep an eye on the following two metrics:

  • queryProgress.progress.inputRowsPerSecond
  • queryProgress.progress.processedRowsPerSecond

In case input is higher than processed you might increase resources for your job or reduce the maximum limit (by reducing the readStream option maxOffsetsPerTrigger). If processed is higher, you may want to increase this limit.

Michael Heil
  • 16,250
  • 3
  • 42
  • 77