Trying to understand how Spark Streaming works?

Question

This might be a stupid question, but I can't seem to find any doc clarifying this in pure English (ok, exaggerated), and after reading the official doc and some blogs, I'm still confused about how driver and executors work.

Here is my current understanding:

1) Driver defines the transformation/computation.

2) Once we call SparkContext.start(), the driver will send the defined transformation/computation to all the executors, such that each executor knows how to process the incoming RDD stream data.

OK, here is some confusing questions I have:

1) Does the driver send the defined transformation/computation to all executors only ONCE AND FOR ALL?

If this is the case, we wouldn't have any chance to redefine/change the computation, right?

For example, I do a word-count job similar to this one, however my job is a little complicated, I want to count only words starting with letter J for the first 60s, and then only words starting with letter K for the next 60s, and then only words starting with ......, that goes on.

So how am I supposed to implement this streaming job in the driver?

2) Or does the driver restart/reschedule all the executors after each batch of data is done?

FOLLOWUP

To solve the 1) question, I think I could make use of some external storage media, like redis, I mean I could implement a processing function count_fn in the driver, and each time when this count_fn is called, it will read from redis to get the starting letter, and then count in the RDD stream, is this a right way to go?

score 2 · Answer 1 · answered Mar 31 '17 at 11:48

Does the driver send the defined transformation/computation to all executors only ONCE AND FOR ALL?

No, each Task is serialized and sent to all the workers per batch iteration. Think what happens when you have an instance of a class that is used inside a transformation, Spark has to able to ship that same instance with all it's state to each of the executors to operate on.

If this is the case, we wouldn't have any chance to redefine/change the computation, right?

The logic inside the definition of the transformation is constant, but that doesn't mean you can't query a third party which stores information which effects the data inside your transformation.

For example, assume you had some external source which indicates which letters you should filter by. Then you can either call transform on the DStream to fetch data from the driver regarding which letter to filter by.

Or does the driver restart/reschedule all the executors after each batch of data is done?

It doesn't restart, it simply starts a new job per batch interval. If you defined your StreamingContext batch duration to be 60 seconds, every 60 seconds a new job (micro batch) will start processing data.

As per your follow up, yes that is the way I'd do it.

Thanks so much for this kind reply, and your answer to my first question just sparked another question in my head: if I do have an instance of class used inside a transformation, this class could be user defined or from standard python/scala/java lib, so this specific instance should be copied and shipped to all executors, right? And to make it, this instance must have a serialization and deserialization method, otherwise it can't be shipped, i.e. cannot be used inside transformation, is that correct? — avocado, Mar 31 '17 at 12:57

Trying to understand how Spark Streaming works?

1 Answers1