2

I'm using the quickstart cloudera VM (CDH 5.10.1) with Pyspark (1.6.0) and Yarn (MR2 Included) to aggregate numerical data per hour. I've got 1 CPU with 4 cores and 32 Go of RAM.

I've got a file named aggregate.py but until today I never submitted the job with spark-submit, I used pyspark interactive shell and copy/paste the code to test it. When starting pyspark interactive shell I used :

pyspark --master yarn-client

I followed the treatment in the web UI accessible at quickstart.cloudera:8088/cluster and could see that Yarn created 3 executors and 1 driver with one core each (Not a good configuration but the main purpose is to make a proof of concept, until we move to a real cluster)

When submitting the same code with spark-submit :

spark-submit --verbose 
    --master yarn 
    --deploy-mode client \
    --num-executors 2 \
    --driver-memory 3G \
    --executor-memory 6G \
    --executor-cores 2 \
    aggregate.py

I only have the driver, which also executes the tasks. Note that spark.dynamicAllocation.enabled is set to true in the environment tab, and spark.dynamicAllocation.minExecutors is set to 2.

I tried using spark-submit aggregate.py only, I still got only the driver as executor. I can't manage to have more than 1 executor with spark-submit, yet it works in spark interactive shell !

My Yarn configuration is as follow :

yarn.nodemanager.resource.memory-mb = 17 GiB

yarn.nodemanager.resource.cpu-vcores = 4

yarn.scheduler.minimum-allocation-mb = 3 GiB

yarn.scheduler.maximum-allocation-mb = 16 GiB

yarn.scheduler.minimum-allocation-vcores = 1

yarn.scheduler.maximum-allocation-vcores = 2

If someone can explain me what I'm doing wrong it would be a great help !

bobolafrite
  • 100
  • 1
  • 11

2 Answers2

0

You have to set the driver memory and executor memory in to spark-defaults.conf. It's located at

$SPARK_HOME/conf/spark-defaults.conf

and if there is a file like

spark-defaults.conf.template

then you have to rename the file as

spark-defaults.conf

and then set the number of executors, executor-memory ,number of executor-cores. you get the example from the template file or check this link

https://spark.apache.org/docs/latest/configuration.html.

or

When we used pyspark It's used default executor-memory but here in spark-submit you set executor-memory = 6G. I think you have to reduce the memory or remove this field so it can used default memory.

Sahil Desai
  • 3,418
  • 4
  • 20
  • 41
  • `spark-submit --verbose --master yarn --deploy-mode client \ --num-executors 2 \ --executor-cores 2 \ aggregate.py` – Sahil Desai Oct 27 '17 at 04:59
0

just a guess, as you said earlier "Yarn created 3 executors and 1 driver with one core each", so you have 4-cores in total.

Now as per your spark-submit statement,

cores = num-executors 2 * executor-cores 2 + for_driver 1 = 5
#but in total you have 4 cores. So it is unable to give you executors(as after driver only 3 cores left)
#Check if this is the issue.
Satya
  • 5,470
  • 17
  • 47
  • 72
  • I agree with you, but as I said, I also used `spark-submit aggregate.py` without any other argument, so it should be able to create at least 2 or 3 executors with 1 core (since `yarn.scheduler.minimum-allocation-vcores` = 1). Am I wrong ? – bobolafrite Oct 27 '17 at 07:31
  • Though I never came across such situation, but my suggestion to you will be, try spark-submit without "--executor-cores 2" and check if it works, Let spark/yarn create 2 executors with available cores...(I am not sure on this too..) – Satya Oct 27 '17 at 08:01
  • It does'nt change anything. Today we got the real cluster, I'll try and see if this error persists – bobolafrite Oct 30 '17 at 08:08