This might be a stupid question, but I can't seem to find any doc clarifying this in pure English (ok, exaggerated), and after reading the official doc and some blogs, I'm still confused about how driver and executors work.
Here is my current understanding:
1) Driver defines the transformation/computation.
2) Once we call SparkContext.start()
, the driver will send the defined transformation/computation to all the executors, such that each executor knows how to process the incoming RDD stream data.
OK, here is some confusing questions I have:
1) Does the driver send the defined transformation/computation to all executors only ONCE AND FOR ALL?
If this is the case, we wouldn't have any chance to redefine/change the computation, right?
For example, I do a word-count job similar to this one, however my job is a little complicated, I want to count only words starting with letter J
for the first 60s, and then only words starting with letter K
for the next 60s, and then only words starting with ......, that goes on.
So how am I supposed to implement this streaming job in the driver?
2) Or does the driver restart/reschedule all the executors after each batch of data is done?
FOLLOWUP
To solve the 1) question, I think I could make use of some external storage media, like redis
, I mean I could implement a processing function count_fn
in the driver, and each time when this count_fn
is called, it will read from redis to get the starting letter, and then count in the RDD stream, is this a right way to go?