2

enter image description here

I wonder why there are so many task number in my spark streaming job ? and it becomes bigger and bigger...

after 3.2 hours' running, it grow to 120020... and after one day's running, it will grow to one million... why?

user2848932
  • 776
  • 1
  • 14
  • 28

3 Answers3

2

I would strongly recommend that you check the parameter spark.streaming.blockInterval, which is a very important one. By default it's 0.5 seconds, i.e. create one task every 0.5 seconds.

So maybe you can try to increase the spark.streaming.blockInterval to be 1min or 10min then the number of tasks should decrease.

My intuition is simply because your consumer is as fast as the producer, so as the time going, more and more tasks are accumulated for further consumption.

It may due to your Spark cluster's incapacity to process such a large batch. It may also be related the checkpoint interval time, maybe you are setting it too large or too small. It may also be related to your settings of Parallelism, Partitions or Data Locality etc.

good luck

Read this

Tuning Spark Streaming for Throughput

.

How-to: Tune Your Apache Spark Jobs (Part 1)

.

How-to: Tune Your Apache Spark Jobs (Part 2)

keypoint
  • 2,268
  • 4
  • 31
  • 59
  • hello, I think what is not normal is the "skipped tasks number" is getting more and more. the real processing task number is constant. but the "skipped task" number is more and more. I could not understand which tasks are skipped … – user2848932 May 15 '15 at 07:56
  • I had the same error, but not very sure about its cause, maybe not enough memory or RDD split error. I'm not sure... – keypoint May 15 '15 at 08:02
  • I guess maybe it depends on the lineage of rdd. the new streaming tasks' lineage are not cut off and the "skipped tasks" are the lineage before. so the tasks remembered by new task will be more and more. once a new task fails, all of the origin "skipped tasks" will be executed again. – user2848932 May 15 '15 at 08:04
  • so I think maybe checkpoint() will solve the problem. Do you know where should I add checkpoint – user2848932 May 15 '15 at 08:04
  • sorry I don't know, I'm now struggling with checkpoint too. my current checkpoint setting is 1 second, and trying to optimize it :P – keypoint May 15 '15 at 08:11
2

This SparkUI feature means that some stage dependencies might have been computed, or not, but were skipped because their output was already available. Therefore they show as skipped.

Please not the might, meaning that until the job finishes Spark don't know for sure whether it will need to go back and re-compute some stages that were initially skipped.

[1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala#L189

ssedano
  • 8,322
  • 9
  • 60
  • 98
1

The nature of a streaming application is to run the same process for each batch of data over time. It looks like you're trying to run with a 1-second batch interval and each interval might spawn several jobs. You show 585 jobs in 3.2 hours, not 120020. However, it also kind of looks like your processing finishes in nowhere like 1 second. I imagine your scheduling delay is very very high. This is a symptom of having far too small a batch interval, I would guess.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • the time interval is 2 minutes, and I think my job runs normally. – user2848932 May 11 '15 at 02:35
  • and when the job begins, the "all task number" is not so many, only several hundreds. after 3.2 hours, it grows to 120020, and after half a day, it grow to 1000000+. – user2848932 May 11 '15 at 02:38
  • I could not understant what is the "skipped task" in the spark streaming task , and why are they skipped? and why the skipped task number get more and more – user2848932 May 15 '15 at 07:58