2

I'm running Jupyterhub with pyspark3 kernel on AWS EMR Cluster. As we might know Jupyterhub pyspark3 on EMR uses Livy session to run workloads on AWS EMR YARN scheduler. My question is about the configuration of spark: executor memory/cores, driver memory/cores etc.

There is already a default configuration in the config.json file of Jupyter:

...

"session_configs":{
      "executorMemory":"4096M",
      "executorCores":2,
      "driverCores":2,
      "driverMemory":"4096M",
      "numExecutors":2
   },

...

We can overwrite this configuration using sparkmagic:

%%configure -f
{"conf":{"spark.pyspark.python": "python3",
         "spark.pyspark.virtualenv.enabled": "true",
         "spark.pyspark.virtualenv.type":"native", 
         "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv", 
         "spark.executor.memory":"2g",
         "spark.driver.memory": "2g",
         "spark.executor.cores": "1",
         "spark.num.executors": "1",
         "spark.driver.maxResultSize": "2g", 
         "spark.yarn.executor.memoryOverhead": "2g",
         "spark.yarn.driver.memoryOverhead": "2g",
         "spark.yarn.queue": "default"
    }
}

There is also the configuration in the spark-defaults.conf file in the master node of the EMR Cluster.

spark.executor.memory            2048M
spark.driver.memory              2048M
spark.yarn.driver.memoryOverhead 409M
spark.executor.cores             2
...

Which configuration will be used when I initiate a SparkSession so run a spark application in the YARN cluster ?

Please find the image of a running spark application on the YARN Scheduler:

 Spark Application on YARN ScreenShot

nael.fridhi
  • 141
  • 1
  • 9

1 Answers1

1

As per my experience and this link that tells how to modify spark configuration, it seems like the modification you make through %%configure -f will be used (ofcourse if you put this as the first command and it starts the session with this configuration).

A.B
  • 20,110
  • 3
  • 37
  • 71