Job aborted due to stage failure: ShuffleMapStage 20 (repartition at data_prep.scala:87) has failed the maximum allowable number of times: 4

Question

I am submitting Spark job with following specification:(same program has been used to run different size of data range from 50GB to 400GB)

/usr/hdp/2.6.0.3-8/spark2/bin/spark-submit 
 --master yarn 
 --deploy-mode cluster 
 --driver-memory 5G 
 --executor-memory 10G 
 --num-executors 60
 --conf spark.yarn.executor.memoryOverhead=4096 
 --conf spark.shuffle.registration.timeout==1500 
 --executor-cores 3 
 --class classname /home//target/scala-2.11/test_2.11-0.13.5.jar

I have tried reparations the data while reading and also applied reparation before do any count by Key operation on RDD:

val rdd1 = rdd.map(x=>(x._2._2,x._2._1)).distinct.repartition(300)
val receiver_count=rdd1.map(x=>x._2).distinct.count

User class threw exception:

org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 20 (repartition at data_prep.scala:87) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 9

Possible duplicate of [Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?](https://stackoverflow.com/questions/28901123/why-do-spark-jobs-fail-with-org-apache-spark-shuffle-metadatafetchfailedexceptio). Maybe some of the answers here can help you. — Shaido, Jul 04 '19 at 03:41
as per above listed post : I have tried reparation before reduce operation ,persist(Using MEMORY_AND_DISK) and setting spark.executor.overhead.memory to 2048MB. but nothing worked out for me, hence i posted as new question to get more help. — manohar, Jul 04 '19 at 14:35
As per the accepted answer there, did you make sure you are not giving the executors too much memory? Also see the answer about heartbeats and check so it doesn't apply to you. — Shaido, Jul 05 '19 at 01:16

score 2 · Answer 1 · answered Jan 28 '22 at 17:20

2

In my case I gave my executors a little more memory and the job went through fine. You should definitely look at what stage your job is failing at and accordingly determine if increasing/decreasing the executors' memory would help.

answered Jan 28 '22 at 17:20

rishab137

161
2
5

Job aborted due to stage failure: ShuffleMapStage 20 (repartition at data_prep.scala:87) has failed the maximum allowable number of times: 4

1 Answers1