Long running spark submit job

Question

I am trying to run a script using spark submit as this

spark-submit -v \
--master yarn \
--num-executors 80 \
--driver-memory 10g \
--executor-memory 10g \
--executor-cores 5 \
--class cosineSimillarity jobs-1.0.jar

This script is implementing DIMSUM algorithm on 60K records.

Referring: https://github.com/eBay/Spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

Unfortunately this continues even after 3 hours. I tired with 1K data and runs successfully within 2min.

Can anyone recommend any changes to spark-submit params to make it faster?

Check Spark Web UI diagrams, maybe you've got bad parallelism - then `repartition` will help — T. Gawęda, Feb 01 '17 at 23:07

score 0 · Answer 1 · answered Feb 02 '17 at 12:44

Your spark-submit statement suggests that you have at least80*50=400 cores, right?

This means you should ensure that you have at least 400 partitions, to ensure that all your cores are working (i.e. each core hast at least 1 tasks to be processed).

Looking at the code you use, I think you should specify the number of partitions when reading the text-file in sc.textFile(), AFAIK it defaults to 2 (see defaultMinPartitions in SparkContext.scala)

Long running spark submit job

1 Answers1