Spark starting more executors than specified

Question

I'm running Spark 1.5.1 in standalone (client) mode using Pyspark. I'm trying to start a job that seems to be memory heavy (in python that is, so that should not be part of the executor-memory setting). I'm testing on a machine with 96 cores and 128 GB of RAM.

I have a master and worker running, started using the start-all.sh script in /sbin.

These are the config files I use in /conf.

spark-defaults.conf:

spark.eventLog.enabled           true
spark.eventLog.dir               /home/kv/Spark/spark-1.5.1-bin-hadoop2.6/logs
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.dynamicAllocation.enabled  false
spark.deploy.
defaultCores       40

spark-env.sh:

PARK_MASTER_IP='5.153.14.30'  # Will become deprecated
SPARK_MASTER_HOST='5.153.14.30'
SPARK_MASTER_PORT=7079
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_WEBUI_PORT=8081

I'm starting my script using the following command:

export SPARK_MASTER=spark://5.153.14.30:7079  #"local[*]"
spark-submit \
    --master ${SPARK_MASTER} \
    --num-executors 1 \
    --driver-memory 20g \
    --executor-memory 30g \
    --executor-cores 40 \
    --py-files code.zip \
<script>

Now, I'm noticing behaviour that I don't understand:

When I start my application with the settings above, I expect there to be 1 executor. However, 2 executors are started, each having 30g of memory and 40 cores. Why does spark do this? I'm trying to limit the number of cores to have more memory per core, how can I enforce this? Now my application gets killed because it uses too much memory.
When I increase executor-cores to over 40, my job does not get started because of not enough resources. I expect that this is because of the defaultCores 40 setting in my spark-defaults. But is't this just as a backup for when my application does not provide a maximum number of cores? I should be able to overwrite that right?

Extract from the error messages I get:

Lost task 1532.0 in stage 2.0 (TID 5252, 5.153.14.30): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)
    ... 15 more

[...]

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 111 in stage 2.0 failed 4 times, most recent failure: Lost task 111.3 in stage 2.0 (TID 5673, 5.153.14.30): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)

score 1 · Answer 1 · answered Aug 29 '16 at 02:53

Check or set the value for spark.executor.instances. The default is 2, which may explain why you get 2 executors.

Since your server has 96 cores, and you set defaultcores to 40, you only have room for 2 executors since 2*40 = 80. The remaining 16 cores are insufficient for another executor and the driver also requires CPU cores.

gsamaras · Answer 2 · 2016-08-29T03:13:48.583

I expect there to be 1 executor. However, 2 executors are started

I think the one executor you see, it's actually the driver.

So one master, one slave (2 nodes in totals).

You can add to your script these configuration flags:

--conf spark.executor.cores=8        <-- will set it 8, you probably want less
--conf spark.driver.cores=8          <-- same, but for driver only

my job does not get started because of not enough resources.

I believe the container gets killed. You see, you ask for too many resources, so every container/task/core tries to take as much memory as possible, and your system can't simple give more.

The container might exceed its memory limits (you should be able to see more in the logs to be certain though).

Spark starting more executors than specified

2 Answers2