Using spark 2.3 structred streaming and kafka as the input stream. My cluster is built from master and 3 workers. (master runs on one of the worker machines) My kafka topic has 3 partitions as the number of the workers. I am using the default trigger and foreach sink to process the data.
When the first message arrives to the driver it is immediately starting the processing of the data on one of the available worker nodes, while processing, a second message arrives, instead of immediatly start processing it on the available worker, the "execution" of the processing is delayed until the first worker ends the processing, now all of the "waiting executions" start processing parallel on all the available workers. (lets say I have 3 waiting messages)
How can I force the execution to start immetetaly on the waiting worker?
** A snippet of my code: **
val sparkSession = SparkSession.builder().config(conf).getOrCreate()
import sparkSession.implicits._
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
override def open(partitionId: Long, version: Long) = true
override def process(filePath: String) = {
val filesSeq = fileHandler
.handleData(filePath) // long processing
}
override def close(errorOrNull: Throwable) = {}
}
val filesDf = kafkaStreamSubscriber
.buildtream(conf, kafkaInputTopic)
val ds = filesDf.map(x=>x.getAs("filePath").asInstanceOf[String])
val query =
ds.writeStream
.foreach(writer)
.start
ds.writeStream
.format("console")
.option("truncate", "false")
.start()
println("lets go....")
query.awaitTermination()
What have I doing wrong? I don't want to have idle workers when I have waiting data to process
Thanx