1

I am using a Spark 2.2.0 cluster configured in Standalone mode. Cluster has 2 octa core machines. This cluster is exclusively for Spark jobs and no other process uses them. I have around 8 Spark Streaming apps which run on this cluster.
I explicitly set SPARK_WORKER_CORES (in spark-env.sh) to 8 and allocate one core to each app using total-executor-cores setting. This config reduces the capability to work in parallel on multiple tasks. If a stage works on a partitioned RDD with 200 partitions, only one task executes at a time. What I wanted Spark to do was to start separate thread for each job and process in parallel. But I couldn't find a separate Spark setting to control the number of threads.
So, I decided to play around and bloated the number of cores (i.e. SPARK_WORKER_CORES in spark-env.sh) to 1000 on each machine. Then I gave 100 cores to each Spark application. I found that spark started processing 100 partitons in parallel this time indicating that 100 threads were being used.
I am not sure if this is the correct method of impacting the number of threads used by a Spark job.

scorpio
  • 329
  • 1
  • 18

1 Answers1

5

You mixed up two things:

  • Cluster manger properties - SPARK_WORKER_CORES - total number of cores that worker can offer. Use it to control a fraction of resources that should be used by Spark in total
  • Application properties --total-executor-cores / spark.cores.max - number of cores that application requests from the cluster manager. Use it control in-app parallelism.

Only the second on is directly responsible for app parallelism as long as, the first one is not limiting.

Also CORE in Spark is a synonym of thread. If you:

allocate one core to each app using total-executor-cores setting.

then you specifically assign a single data processing thread.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Ya I understand that those are two different things but they work hand in hand as you also indicated in your answer: total_executor_cores has to be smaller than SPARK_WORKER_CORES. But this line answers my question: "CORE in Spark is a synonym of thread" – scorpio Jan 29 '18 at 10:23