I wonder why there are so many task number in my spark streaming job ? and it becomes bigger and bigger...
after 3.2 hours' running, it grow to 120020... and after one day's running, it will grow to one million... why?
I wonder why there are so many task number in my spark streaming job ? and it becomes bigger and bigger...
after 3.2 hours' running, it grow to 120020... and after one day's running, it will grow to one million... why?
I would strongly recommend that you check the parameter spark.streaming.blockInterval, which is a very important one. By default it's 0.5 seconds, i.e. create one task every 0.5 seconds.
So maybe you can try to increase the spark.streaming.blockInterval to be 1min or 10min then the number of tasks should decrease.
My intuition is simply because your consumer is as fast as the producer, so as the time going, more and more tasks are accumulated for further consumption.
It may due to your Spark cluster's incapacity to process such a large batch. It may also be related the checkpoint interval time, maybe you are setting it too large or too small. It may also be related to your settings of Parallelism, Partitions or Data Locality etc.
good luck
Read this
.
.
This SparkUI
feature means that some stage dependencies might have been computed, or not, but were skipped because their output was already available. Therefore they show as skipped
.
Please not the might
, meaning that until the job finishes Spark
don't know for sure whether it will need to go back and re-compute some stages that were initially skipped.
The nature of a streaming application is to run the same process for each batch of data over time. It looks like you're trying to run with a 1-second batch interval and each interval might spawn several jobs. You show 585 jobs in 3.2 hours, not 120020. However, it also kind of looks like your processing finishes in nowhere like 1 second. I imagine your scheduling delay is very very high. This is a symptom of having far too small a batch interval, I would guess.