Azure Synapse Apache Spark : Pipeline level spark configuration

Question

Trying to configure spark for the entire azure synapse pipeline, Found Spark session config magic command and How to set Spark / Pyspark custom configs in Synapse Workspace spark pool . %%configure magic command works fine for a single notebook. Example:

Insert cell with the below content at the Beginning of the notebook

%%configure -f
{
    "driverMemory": "28g",
    "driverCores": 4,
    "executorMemory": "32g",
    "executorCores": 4,
    "numExecutors" : 5
}

Then the below emits expected values.

spark_executor_instances = spark.conf.get("spark.executor.instances")
print(f"spark.executor.instances {spark_executor_instances}")

spark_executor_memory = spark.conf.get("spark.executor.memory")
print(f"spark.executor.memory {spark_executor_memory}")

spark_driver_memory = spark.conf.get("spark.driver.memory")
print(f"spark.driver.memory {spark_driver_memory}")

Although if i add that notebook as a first activity in Azure Synapse Pipeline, what happens is that Apache Spark Application which executes that notebook has correct configuration, but the rest of the notebooks in pipeline fall back to default configuration.

How can i configure spark for the entire pipeline ? Should i copy paste above %%configure .. in each and every notebook in pipeline or is there a better way ?

if you want you configuration to be same for the entire pipeline why don't you make that your default configuration so you don't need that extra configuration cell. You should use this `%%configure` only when you want it to be different for any edge or specific case. — Nikunj Kakadiya, Dec 21 '21 at 13:23
@NikunjKakadiya thanks for a reply. Well, 1) uploading a config file to Spark Pool directly doesn't seem to work, because as the above linked article say, Azure Synapse overrides some of those configs with default ones. 2) I want to have say one configuration for one pipeline and another configuration for another. Do you know the way how that can be achieved ? — tchelidze, Dec 21 '21 at 15:35

Utkarsh Pal · Answer 1 · 2021-12-23T10:21:28.797

1

Yes, this is the well known option AFAIK. You need to define %%configure -f at the beginning of each Notebook in order to override default settings for your Job.

Alternatively, you can try by traversing to the Spark pool on Azure Portal and set the configurations in the spark pool by uploading text file which looks like this:

Please refer this third-party article for more details.

Moreover, looks like one cannot specify less than 4 cores for executor, nor driver. If you do, you get 1 core but nevertheless 4 core is reserved.

edited Dec 23 '21 at 10:21

answered Dec 23 '21 at 07:11

Utkarsh Pal

4,079
1
5
14

yes, although `But in the Synapse spark pool, few of these user-defined configurations get overridden by the default value of the Spark pool.`. – tchelidze Dec 23 '21 at 09:49
and because of this issue you need to define `%%configure -f` in all notebooks. – Utkarsh Pal Dec 23 '21 at 09:58
Yep, although looks like you cannot specify less than 4 cores for executor, nor driver. If you do, you get 1 core but nevertheless 4 core is reserved – tchelidze Dec 23 '21 at 10:00
ohh. thank you for adding this valuable point, I'll update this in the answer. If you find the given answer useful please accept it as an answer (click on check mark on left side of answer) to help other community members – Utkarsh Pal Dec 23 '21 at 10:03

Azure Synapse Apache Spark : Pipeline level spark configuration

1 Answers1