I'm running Spark 1.5.1 in standalone (client) mode using Pyspark. I'm trying to start a job that seems to be memory heavy (in python that is, so that should not be part of the executor-memory setting). I'm testing on a machine with 96 cores and 128 GB of RAM.
I have a master and worker running, started using the start-all.sh script in /sbin.
These are the config files I use in /conf.
spark-defaults.conf:
spark.eventLog.enabled true
spark.eventLog.dir /home/kv/Spark/spark-1.5.1-bin-hadoop2.6/logs
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.dynamicAllocation.enabled false
spark.deploy.
defaultCores 40
spark-env.sh:
PARK_MASTER_IP='5.153.14.30' # Will become deprecated
SPARK_MASTER_HOST='5.153.14.30'
SPARK_MASTER_PORT=7079
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_WEBUI_PORT=8081
I'm starting my script using the following command:
export SPARK_MASTER=spark://5.153.14.30:7079 #"local[*]"
spark-submit \
--master ${SPARK_MASTER} \
--num-executors 1 \
--driver-memory 20g \
--executor-memory 30g \
--executor-cores 40 \
--py-files code.zip \
<script>
Now, I'm noticing behaviour that I don't understand:
- When I start my application with the settings above, I expect there to be 1 executor. However, 2 executors are started, each having 30g of memory and 40 cores. Why does spark do this? I'm trying to limit the number of cores to have more memory per core, how can I enforce this? Now my application gets killed because it uses too much memory.
- When I increase
executor-cores
to over 40, my job does not get started because of not enough resources. I expect that this is because of thedefaultCores 40
setting in my spark-defaults. But is't this just as a backup for when my application does not provide a maximum number of cores? I should be able to overwrite that right?
Extract from the error messages I get:
Lost task 1532.0 in stage 2.0 (TID 5252, 5.153.14.30): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:203)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)
... 15 more
[...]
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 111 in stage 2.0 failed 4 times, most recent failure: Lost task 111.3 in stage 2.0 (TID 5673, 5.153.14.30): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)