0

I am trying to run a script using spark submit as this

spark-submit -v \
--master yarn \
--num-executors 80 \
--driver-memory 10g \
--executor-memory 10g \
--executor-cores 5 \
--class cosineSimillarity jobs-1.0.jar

This script is implementing DIMSUM algorithm on 60K records.

Referring: https://github.com/eBay/Spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

Unfortunately this continues even after 3 hours. I tired with 1K data and runs successfully within 2min.

Can anyone recommend any changes to spark-submit params to make it faster?

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
MasterGoGo
  • 98
  • 6

1 Answers1

0

Your spark-submit statement suggests that you have at least80*50=400 cores, right?

This means you should ensure that you have at least 400 partitions, to ensure that all your cores are working (i.e. each core hast at least 1 tasks to be processed).

Looking at the code you use, I think you should specify the number of partitions when reading the text-file in sc.textFile(), AFAIK it defaults to 2 (see defaultMinPartitions in SparkContext.scala)

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145