Pyspark - MetadataFetchFailedException when calculating tf - idf

Question

I am working on a dataset of initially 569 MB, calculating the TF-IDF metric. Although I am getting results in the end I keep getting the below error:

WARN scheduler.TaskSetManager: Lost task 13.0 in stage 11.0 (TID 84, X.X.X.X, executor 0): FetchFailed(null, shuffleId=4, mapId=-1, reduceId=4, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 4
    at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:882)
    at org.apache.spark.MapOutputTracker$$anonfun$convertMapStatuses$2.apply(MapOutputTracker.scala:878)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:878)
    at org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:691)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:105)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:103)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I have read relative posts and have already changed some spark properties as below

spark=SparkSession.builder.appName("part_2_task_2").config('spark.executor.memory','2g').config('spark.executor.memoryOverhead','1g').config('spark.shuffle.io.maxRetries',5).config('spark.shuffle.io.retryWait','30s').config('spark.network.timeout','200s').getOrCreate()

So currently I have the below cluster details:

spark.executor.cores    2
spark.executor.instances    2
spark.executor.memory   2g
spark.executor.memoryOverhead   1g

Moreover I checked in more details where the issue is from the UI and was able to find that the failed Stage arises from line 126 of my code which is the below join:

tfidf = tf.join(idf)

and the two rdds tf and idf are calculated as

tf = step1.map(lambda x: (x[0][0], (x[0][1], x[0][2], x[0][3], x[1]/x[0][3])))
idf = step1.map(lambda x: (x[0][0], (x[0][2], x[1], 1))). \
    map(lambda x: (x[0], x[1][2])). \
    reduceByKey(lambda x, y: x +y ). \
    map(lambda x: (x[0], (x[1], math.log10(number_of_docs/x[1]))))

The rdds tf and idf have different .count() since tf is per each document and word whereas idf is per each word only thus I am joining them. Would that be an issue so I should check if they are equal size before joining by using partition commands although they are costly? If this is not an issue what would be the ideal cluster properties to process this kind of size data as mentioned above?

score 0 · Answer 1 · answered Nov 03 '21 at 10:49

I gave another 1g to the executor memory from the available 4G memory my virtual machine provides me so ended up in the following setting

spark.executor.memory   3g
spark.executor.memoryOverhead   1g

and the Exception disappeared. I am not still sure if this is the best solution, or if my code needed fixing to overcome this issue of joining two rdds of different lengths, hence partitions that maybe have caused the problem. Any explanations would be much appreciated, since this is my first attempt with Apache Spark applications.

Pyspark - MetadataFetchFailedException when calculating tf - idf

1 Answers1