0

I have a very strange issue with spark structure streaming. Spark structure streaming creates two spark jobs for every micro-batch. As a result, read data from Kafka twice. Here is a simple code snippet.

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger

object CheckHowSparkReadFromKafka {
  def main(args: Array[String]): Unit = {
    val session = SparkSession.builder()
      .config(new SparkConf()
        .setAppName(s"simple read from kafka with repartition")
        .setMaster("local[*]")
        .set("spark.driver.host", "localhost"))
      .getOrCreate()
    val testPath = "/tmp/spark-test"
    FileSystem.get(session.sparkContext.hadoopConfiguration).delete(new Path(testPath), true)
    import session.implicits._
    val stream = session
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers",        "kafka-20002-prod:9092")
      .option("subscribe", "topic")
      .option("maxOffsetsPerTrigger", 1000)
      .option("failOnDataLoss", false)
      .option("startingOffsets", "latest")
      .load()
      .repartitionByRange( $"offset")
      .writeStream
      .option("path", testPath + "/data")
      .option("checkpointLocation", testPath + "/checkpoint")
      .format("parquet")
      .trigger(Trigger.ProcessingTime(10.seconds))
      .start()
    stream.processAllAvailable()

This happens because if .repartitionByRange( $"offset"), if I remove this line, all good. But with spark create two jobs, one with 1 stage just read from Kafka, the second with 3 stage read -> shuffle -> write. So the result of the first job never used.

This has a significant impact on performance. Some of my Kafka topics have 1550 partitions, so read them twice is a big deal. In case I add cache, things going better, but this is not a way for me. In local mode, the first job in batch takes less than 0.1 ms, except batch with index 0. But in YARN cluster and Messos both jobs fully expected and on my topics take near 1.2 min.

Why it's happen? How can I avoid this? Look like Bug?

P.S. I use spark 2.4.3.

Grigoriev Nick
  • 1,099
  • 8
  • 24
  • Can you see the task number that the write stage is creating and the shuffle write/read? – Emiliano Martinez Apr 10 '20 at 08:51
  • @EmiCareOfCell44 Yes I can see all task numbers and stages. The first Job has 1 stage where the task number is 240(same to the number of partitions in Kafka). The second Job has 2 stages, where 1 stage similar to first job and the second one is shuffle + write. – Grigoriev Nick Apr 10 '20 at 09:51
  • I suppose that for the range partitioner Spark must read all messages first in order to create the RangePartiitoner. Beacause offset is an unknown number in each partition it reads and reallocates all messages in one partition to create de index and then shuffle the data to be processed by each executor. is the range partitoner mandatory for your case? – Emiliano Martinez Apr 10 '20 at 10:10
  • 1. Yes, RangePartitioner is mandatory. 2. Spark can't reuse the result of a job in another job, without persisting it to some sink and read again. And most and thing that second job 1 stage absolutely the same as the only stage in the first one. Even more, if I will use Direct streaming and create a data frame from every micro-batch red and then use same logic - I will have 1 Job. – Grigoriev Nick Apr 10 '20 at 10:16
  • The same happens when I use simple sort. – Grigoriev Nick Mar 13 '21 at 09:48

1 Answers1

0

There is no bug in the spark in this case. The root cause of reading this data from Kafka twice is very simple. repartitionByRange function generates two spark jobs.

One for actual repartition.

One for sampling to find borders for partitions.

Please find more details in spark jira

Grigoriev Nick
  • 1,099
  • 8
  • 24