I am trying to understand the internals of Spark Streaming (not Structured Streaming), specifically the way tasks see the DStream. I am going over the source code of Spark in scala, here. I understand the call stack:
ExecutorCoarseGrainedBackend (main) -> Executor (launchtask) -> TaskRunner (Runnable).run() -> task.run(...)
I understand the DStream really is a hashmap of RDDs but I am trying to understand the way tasks see the DStream. I know that there are basically 2 approaches to Kafka Spark integration:
Receiver based using High Level Kafka Consumer APIs
Here a new (micro-)batch is created at every batch interval (say 5 secs) with say 5 partitions (=> 1 sec block interval) by the Receiver task and handed downstream to Regular tasks.
Question: Considering our example where every microbatch is created every 5 secs; has exactly 5 partitions and all these partitions of all the microbatches are supposed to be DAG-ged downstream the exact same way, is the same regular task re-used over and over again for the same partition id of every microbatch (RDD) as a long running task? e.g.
If ubatch1 of partitions (P1,P2,P3,P4,P5) at time T0 is assigned to task ids (T1, T2, T3, T4, T5), will ubatch2 of partitions (P1',P2',P3',P4',P5') at time T5 be also assigned to the same set of tasks (T1, T2, T3, T4, T5) or will new tasks (T6, T7, T8, T9, T10) be created for ubatch2?
If the latter is the case then, wouldn't it be performance intensive having to send new tasks over the network to executors every 5 seconds when you already know that there are tasks doing the exact same thing and could be re-used as long running tasks?
Direct using Low Level Kafka Consumer APIs
Here a Kafka Partition maps to a Spark Partition and therefore a Task. Again, considering 5 Kafka partitions for a topic t, we get 5 Spark partitions and their corresponding tasks.
Question: Say, the ubatch1 at T0 has partitions (P1,P2,P3,P4,P5) assigned to tasks (T1, T2, T3, T4, T5). Will ubatch2 of partitions (P1',P2',P3',P4',P5') at time T5 be also assigned to the same set of tasks (T1, T2, T3, T4, T5) or will new tasks (T6, T7, T8, T9, T10) be created for ubatch2?