1

I am trying to understand the internals of Spark Streaming (not Structured Streaming), specifically the way tasks see the DStream. I am going over the source code of Spark in scala, here. I understand the call stack:

ExecutorCoarseGrainedBackend (main) -> Executor (launchtask) -> TaskRunner (Runnable).run() -> task.run(...) 

I understand the DStream really is a hashmap of RDDs but I am trying to understand the way tasks see the DStream. I know that there are basically 2 approaches to Kafka Spark integration:

  • Receiver based using High Level Kafka Consumer APIs

    Here a new (micro-)batch is created at every batch interval (say 5 secs) with say 5 partitions (=> 1 sec block interval) by the Receiver task and handed downstream to Regular tasks.

    Question: Considering our example where every microbatch is created every 5 secs; has exactly 5 partitions and all these partitions of all the microbatches are supposed to be DAG-ged downstream the exact same way, is the same regular task re-used over and over again for the same partition id of every microbatch (RDD) as a long running task? e.g.

    If ubatch1 of partitions (P1,P2,P3,P4,P5) at time T0 is assigned to task ids (T1, T2, T3, T4, T5), will ubatch2 of partitions (P1',P2',P3',P4',P5') at time T5 be also assigned to the same set of tasks (T1, T2, T3, T4, T5) or will new tasks (T6, T7, T8, T9, T10) be created for ubatch2?

    If the latter is the case then, wouldn't it be performance intensive having to send new tasks over the network to executors every 5 seconds when you already know that there are tasks doing the exact same thing and could be re-used as long running tasks?

  • Direct using Low Level Kafka Consumer APIs

    Here a Kafka Partition maps to a Spark Partition and therefore a Task. Again, considering 5 Kafka partitions for a topic t, we get 5 Spark partitions and their corresponding tasks.

    Question: Say, the ubatch1 at T0 has partitions (P1,P2,P3,P4,P5) assigned to tasks (T1, T2, T3, T4, T5). Will ubatch2 of partitions (P1',P2',P3',P4',P5') at time T5 be also assigned to the same set of tasks (T1, T2, T3, T4, T5) or will new tasks (T6, T7, T8, T9, T10) be created for ubatch2?

Sheel Pancholi
  • 621
  • 11
  • 25
  • That, I agree. I am trying to get an insight into this to be able to understand the entire picture bottom up. I want to write one myself.. A simplistic one.. It will help me understand lots of aspects – Sheel Pancholi May 12 '19 at 19:13
  • Very noble, a bit like myself, but there is also Structured Streaming which is not there entirely makes life a lot easier. there is only so many hours in a day! – thebluephantom May 12 '19 at 19:14
  • Anyone who could help me understand this? There is no problem moving from Receiver to Direct based approach or even to Structured Streaming for that matter. Just that I am trying to understand the inner details of each to be able to better appreciate the evolution time line, the existing problem and the approach taken. – Sheel Pancholi May 13 '19 at 05:39

1 Answers1

0

After going over the source code of Apache Spark, here is the definitive answer:

Its a pretty intuitive approach.

  1. We use the SparkStreamingContext (ssc) from the SparkContext to create and save our sequence of transformations on the stream in the form of a DStream DAG ending at a ForEachDStream DStream where each DStream is a container of RDDs i.e. Hashmap
  2. The ForEachDStream is registered with the DStreamGraph of the ssc.
  3. On ssc.start(-ing) the execution, the JobScheduler puts our saved plan on an event loop, that executes every ubatch interval secs creating/extracting an RDD for each DStream and from each DStream at that time, and saving it in the HashMap for the corr. DStream for a rememberDuration duration of time (e.g. for windowing)
  4. and in the process creates the RDD DAG ending in the action specified in the ForEachDStream which then submits a new job to the DAG Scheduler.

This cycle repeats every ubatch interval seconds.

Sheel Pancholi
  • 621
  • 11
  • 25