7

I'm running a python script in pyspark and got the following error: NameError: name 'spark' is not defined

I looked it up and found that the reason is that spark.dynamicAllocation.enabled is not allowed yet.

According to Spark's documentation (https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-dynamic-allocation.html#spark_dynamicAllocation_enabled): spark.dynamicAllocation.enabled (default: false) controls whether dynamic allocation is enabled or not. It is assumed that spark.executor.instances is not set or is 0 (which is the default value).

Since the default setting is false, I need to change the Spark setting to enable spark.dynamicAllocation.enabled.

I installed Spark with brew, and didn't change its configuration/setting.

How can I change the setting and enable spark.dynamicAllocation.enabled?

Thanks a lot.

Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
mflowww
  • 6,835
  • 6
  • 18
  • 18
  • Above link is not spark official documentation. Its Mastering Apache Spark Book by Jack who is also user of SO. please change it appropriately :) – Ram Ghadiyaram Oct 25 '16 at 03:08

5 Answers5

8

Question : How can I change the setting and enable spark.dynamicAllocation.enabled?

There are 3 options through which you can achive this.
1) modify the parameters mentioned below in the spark-defaults.conf
2) sending the below parameters from --conf from your spark-submit
3) Programatically specifying the config of dynamic allocation as demonstrated below.

out of which programatically you can do this way You can do it in programmatic way like this.

val conf = new SparkConf()
      .setMaster("ClusterManager")
      .setAppName("test-executor-allocation-manager")
      .set("spark.dynamicAllocation.enabled", "true")
      .set("spark.dynamicAllocation.minExecutors", 1)
      .set("spark.dynamicAllocation.maxExecutors", 2)
      .set("spark.shuffle.service.enabled", "true") // for stand alone
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • Thanks a lot! Shall I include the script you provided here, to the pyspark I write? Or this is part of the configuration .sh file I should modify? – mflowww Oct 25 '16 at 03:46
  • you have to include this in your python program file moreover above is scala syntax – Ram Ghadiyaram Oct 25 '16 at 06:48
  • Thanks a lot. I'm writing Python script to send into pyspark. Let me try to modify what you suggest here and see if it works. – mflowww Oct 25 '16 at 17:26
3

There are several places you can set it. If you would like to enable it on a per job basis, set the following in each application:

conf.set("spark.dynamicAllocation.enabled","true")

If you want to set if for all jobs, navigate to the spark.conf file. In the Hortonworks distro it should be

/usr/hdp/current/spark-client/conf/

Add the setting to your spark-defaults.conf and should be good to go.

Joe Widen
  • 2,378
  • 1
  • 15
  • 21
  • 1
    Thanks a lot! I would like to enable it on a per job basis. conf.set("spark.dynamicAllocation.enabled","true") is this a command line that I shall type in terminal? Which directory should I change to, before I type this command line? Thanks a lot! – mflowww Oct 25 '16 at 03:43
  • If you're running from command line using spark-shell, start up the shell with is command: spark-shell --conf spark.dynamicAllocation.enabled=true It won't matter what directory you are in, when you start up the shell in com If you're writing an application, set it inside the application after you create the spark config with the conf.set(). – Joe Widen Oct 25 '16 at 07:10
  • thanks a lot. I see. If I'm writing a Python script and try to run it with spark-submit in command line (not inside the pyspark shell), I shall just include this line of code in my Python script, correct? – mflowww Oct 25 '16 at 17:25
1

This is an issue that affects Spark installations made using other resources as well, such as the spark-ec2 script for installing on Amazon Web Services. From the Spark documentation, two values in SPARK_HOME/conf/spark-defaults.conf need to be set :

spark.shuffle.service.enabled   true
spark.dynamicAllocation.enabled true

see this: https://spark.apache.org/docs/latest/configuration.html#dynamic-allocation

If your installation has a spark-env.sh script in SPARK_HOME/conf, make sure that it does not have lines such as the following, or that they are commented out:

export SPARK_WORKER_INSTANCES=1 #or some other integer, or
export SPARK_EXECUTOR_INSTANCES=1 #or some me other integer
Peter Pearman
  • 129
  • 1
  • 10
0

Configuration parameters can be set in pyspark via notebook using the following similar command:

spark.conf.set("spark.sql.crossJoin.enabled", "true")
Gaurav Kumar
  • 133
  • 1
  • 5
0

In addition to previous answers, all configs mentioned may not work because of interpreter settings (if you use Zeppelin). I use Livy and its default settings override the dynamicAllocation parameters.