Spark speculative tasks and its performance overhead

Question

I am currently exploring spark's speculative tasks option.

Below are my configuration which I am planning to use. I am reading the data from kafka and using repartition() I am creating around 200+ tasks in my streaming code.

  .set("spark.speculation", "true")
  .set("spark.speculation.interval", "1000")
  .set("spark.speculation.multiplier", "2")
  .set("spark.speculation.quantile", "0.75")

Will the above configuration on speculative task have any impact on the overall performance of my streaming job ? If so are there any best practices in using spark's speculative tasks option ?

score 1 · Answer 1 · answered Jul 05 '22 at 21:20

See https://www.youtube.com/watch?v=5RppAH780DM or https://databricks.com/session_na21/best-practices-for-enabling-speculative-execution-on-large-scale-platforms

Speculative is just that, it may help if there is one or more straggler Tasks for your Stage based on calcs influenced by the settings. It depends on your set up and runtime actually.

I would use 0.9 instead of 0.75. I also always think about idempotency.

In short also a question of experimentation; too much speculation can cause driver overhead and other excessive use of resources that may not be needed. In short this question cannot really be answered.

Spark speculative tasks and its performance overhead

1 Answers1