Cannot get faster results via yarn when running spark in a hadoop cluster

Question

Applying an LSH algorithm in Spark 1.4 (https://github.com/soundcloud/cosine-lsh-join-spark/tree/master/src/main/scala/com/soundcloud/lsh), I process a text file (4GB) in a LIBSVM format (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) to find duplicates. First, I have run my scala script in a server using only one executor with 36 cores. I retrieved my results in 1,5 hrs.

In order to get my results much faster, I tried to run my code in a hadoop cluster via yarn in an hpc with 3 nodes where each node has 20 cores and 64 gb memory. Since I am not experienced much running codes in hpc, I have followed the suggestions given here: https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

As a result, I have submitted spark as below:

spark-submit --class com.soundcloud.lsh.MainCerebro --master yarn-cluster --num-executors 11 --executor-memory 19G --executor-cores 5 --driver-memory 2g cosine-lsh_yarn.jar

As I understood, I have assigned 3 executors per node and 19 gb for each executor.

However, I could not get my results even though more than 2 hours passed.

My spark configuration is:

val conf = new SparkConf()
      .setAppName("LSH-Cosine")
      .setMaster("yarn-cluster")
      .set("spark.driver.maxResultSize", "0");

How can I dig this issue? From where should I start to improve calculation time?

EDIT:

1)

I have noticed that coalesce is way much slower in yarn

  entries.coalesce(1, true).saveAsTextFile(text_string)

2)

EXECUTORS AND STAGES FROM HPC:

EXECUTORS AND STAGES FROM SERVER:

My first hunch is the yarn cluster doesn't provide more parallelism (40 total cores v.s. 36 cores) but it introduces network overhead. Without more info, it's impossible to find out the cause. You can use the Spark UI to compare the time of jobs and see which one is slower. — zsxwing, Dec 20 '16 at 01:41
@zsxwing I have added some UI trackings. As seen the stages take a bit longer in yarn cluster especially during sorting procedures. Do these results tell something important? — mlee_jordan, Dec 20 '16 at 15:27
My hunch is sending shuffle data over the network makes the job on Yarn slow. — zsxwing, Dec 20 '16 at 22:33

score 0 · Answer 1 · answered Jul 19 '17 at 19:03

More memory is clogged in the storage memory. You are not using that memory efficiently ie (you are caching the data). A total of less than 10 gigs is used of 40 gigs. You are reduce that memorystorge and use that memoryexecution.

Even though you specified 11 executors it started only 4 executors. Inference from first spark UI screenshot. Total cores used by the spark is only 19 across all executors. Total cores equal to number of task running.

Please go through the following link.

https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html

Cannot get faster results via yarn when running spark in a hadoop cluster

1 Answers1