Spark exception after re-submitting stopped application

Question

I'm running a Spark job (from a Spark notebook) using dynamic allocation with the options

"spark.master": "yarn-client",
"spark.shuffle.service.enabled": "true",
"spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.executorIdleTimeout": "30s",
"spark.dynamicAllocation.cachedExecutorIdleTimeout": "1h",
"spark.dynamicAllocation.minExecutors": "0",
"spark.dynamicAllocation.maxExecutors": "20",
"spark.executor.cores": 2

(Note: I'm not sure yet whether the issue is caused by dynamicAllocation or not)

I'm using Spark version 1.6.1.

If I cancel a running job/app (either by pressing the cancel-button on the cell in the notebook, or by shuting down the notebook server and thus the app) and restart the same app shortly (some minutes) after, I often get the following excpetion:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 38, i89810.sbb.ch): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
         at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
         at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
         at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
         at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
         at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
         at org.apache.spark.scheduler.Task.run(Task.scala:89)
         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
         at scala.Option.getOrElse(Option.scala:120)
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
         at scala.collection.immutable.List.foreach(List.scala:318)
         at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
         at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
         at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
         ... 11 more

Using the Yarn ResourceManager, I verified that the old job is not running anymore before re-submitting the job. Still I suppose that the problem arises because the killed job is not yet fully cleaned up and interferes with the newly launched job?

Somebody has encountered the same issue and knows how to solve this?

Do you use broadcast variables in the code? Did you enable `YarnShuffleService` in `yarn-site.xml`? See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/spark-yarn-YarnShuffleService.html — Jacek Laskowski, Aug 31 '16 at 21:16
@JacekLaskowski No, I do not broadcast any variables in my code. And no, I don't have YarnShuffleService configured in yarn-site.xml, so I suppose its disabled? — Raphael Roth, Sep 01 '16 at 06:09
If you have not configured YarnShuffleService, my guess is that your dynamic allocation may or may not work. — Jacek Laskowski, Sep 01 '16 at 08:04

score 0 · Answer 1 · answered Oct 25 '16 at 20:03

You need to setup external shuffle service when dynamic allocation is enabled. Otherwise shuffle files are deleted when executors are removed. Which is why Failed to get broadcast_3_piece0 of broadcast_3 exception is thrown.

For more information on this, see official spark documentation Dynamic Resource Allocation

Spark exception after re-submitting stopped application

1 Answers1