Spark structured streaming process each row on different worker nodes as soon as it arraives

Question

Using spark 2.3 structred streaming and kafka as the input stream. My cluster is built from master and 3 workers. (master runs on one of the worker machines) My kafka topic has 3 partitions as the number of the workers. I am using the default trigger and foreach sink to process the data.

When the first message arrives to the driver it is immediately starting the processing of the data on one of the available worker nodes, while processing, a second message arrives, instead of immediatly start processing it on the available worker, the "execution" of the processing is delayed until the first worker ends the processing, now all of the "waiting executions" start processing parallel on all the available workers. (lets say I have 3 waiting messages)

How can I force the execution to start immetetaly on the waiting worker?

** A snippet of my code: **

val sparkSession = SparkSession.builder().config(conf).getOrCreate()
import sparkSession.implicits._

import org.apache.spark.sql.ForeachWriter

val writer = new ForeachWriter[String] {
  override def open(partitionId: Long, version: Long) = true
  override def process(filePath: String) = {
    val filesSeq = fileHandler
      .handleData(filePath) // long processing

  }
  override def close(errorOrNull: Throwable) = {}
}

val filesDf = kafkaStreamSubscriber
  .buildtream(conf, kafkaInputTopic)

val ds = filesDf.map(x=>x.getAs("filePath").asInstanceOf[String])


val query =
  ds.writeStream        
    .foreach(writer)
    .start

ds.writeStream
  .format("console")
  .option("truncate", "false")
  .start()

println("lets go....")

query.awaitTermination()

What have I doing wrong? I don't want to have idle workers when I have waiting data to process

Thanx

score 0 · Answer 1 · answered Aug 15 '18 at 20:19

0

Refer to Spark Structured Streaming Triggers documentation section

As far as I understand, default trigger process one micro batch at a time. I would suggest consider Experimental Continuous mode if you need process data as soon as it arrives.

My understanding is that if you use trigger with let's say 5 seconds, the micro batch will read messages from all 3 partitions and you will have 3 tasks running in the same time. Until they all finished, there will be no micro batch started.

Hope it helps!

answered Aug 15 '18 at 20:19

Mikhail Dubkov

1,223
1
12
16

Thank you @Mikhail but, refer to [Spark Structured Streaming Triggers documentation](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) **"If the previous micro-batch takes longer than the interval to complete, then the next micro-batch will start as soon as the previous one completes"** but there are still free resources on my cluster, I am looking for behavior more like message driven events, when a message arrives if there are available executor it will start processing as soon as posible. – D. bachar Aug 16 '18 at 08:40
In general, Spark is mostly micro batch based, you may try experimental continuous mode or other event driven framework such as Storm. – Mikhail Dubkov Aug 16 '18 at 17:04

Spark structured streaming process each row on different worker nodes as soon as it arraives

1 Answers1