0

Azure Synapse provides managed spark pool, where the spark jobs can be submitted.

  1. How do submit spark-job (as jars) along with dependencies to the pool2 using Java
  2. If multiple jobs are submitted (each along with its own set of dependencies), then are the dependencies shared across the jobs. Or are they agnostic of each other?
Jatin
  • 31,116
  • 15
  • 98
  • 163

1 Answers1

0

For (1):

Add the following dependency:

    <dependency>
        <groupId>com.azure</groupId>
        <artifactId>azure-analytics-synapse-spark</artifactId>
        <version>1.0.0-beta.4</version>
    </dependency>
    <dependency>
        <groupId>com.azure</groupId>
        <artifactId>azure-identity</artifactId>
    </dependency>

With below sample code:

import com.azure.analytics.synapse.spark.SparkBatchClient;
import com.azure.analytics.synapse.spark.SparkClientBuilder;
import com.azure.analytics.synapse.spark.models.SparkBatchJob;
import com.azure.analytics.synapse.spark.models.SparkBatchJobOptions;
import com.azure.identity.DefaultAzureCredentialBuilder;

import java.util.*;

public class SynapseService {
    private final SparkBatchClient batchClient;

    public SynapseService() {
        batchClient = new SparkClientBuilder()
                .endpoint("https://xxxx.dev.azuresynapse.net/")
                .sparkPoolName("TestPool")
                .credential(new DefaultAzureCredentialBuilder().build())
                .buildSparkBatchClient();
    }

    public SparkBatchJob submitSparkJob(String name, String mainFile, String mainClass, List<String> arguments, List<String> jars) {
        SparkBatchJobOptions options = new SparkBatchJobOptions()
                .setName(name)
                .setFile(mainFile)
                .setClassName(mainClass)
                .setArguments(arguments)
                .setJars(jars)
                .setExecutorCount(3)
                .setExecutorCores(4)
                .setDriverCores(4)
                .setDriverMemory("6G")
                .setExecutorMemory("6G");
        return batchClient.createSparkBatchJob(options);
    }

    /**
     * All possible Livy States: https://learn.microsoft.com/en-us/rest/api/synapse/data-plane/spark-batch/get-spark-batch-jobs#livystates
     *
     * Some of the values: busy, dead, error, idle, killed, not_Started, recovering, running, shutting_down, starting, success
     * @param id
     * @return
     */
    public SparkBatchJob getSparkJob(int id, boolean detailed) {
        return batchClient.getSparkBatchJob(id, detailed);
    }


    /**
     * Cancels the ongoing synapse spark job
     * @param jobId id of the synapse job
     */
    public void cancelSparkJob(int jobId) {
        batchClient.cancelSparkBatchJob(jobId);
    }

}

And finally submit the spark-job:

SynapseService synapse = new SynapseService();
synapse.submitSparkJob("TestJob",
        "abfss://builds@xxxx.dfs.core.windows.net/core/jars/main-module_2.12-1.0.jar",
        "com.xx.Main",
        Collections.emptyList(),
        Arrays.asList("abfss://builds@xxxx.dfs.core.windows.net/core/jars/*"));

Finally, you will need to provide the necessary role in:

  1. Open Synapse Analytics Studio
  2. Manage -> Access Control
  3. Provide the role Synapse Compute Operator and Synapse Compute Operator to the caller

To answer question-2:

When jobs are submitted in synapse via jars, they are equivalent to spark-submit. So all the jobs are agnostic of each other and do not share each other's dependencies.

Jatin
  • 31,116
  • 15
  • 98
  • 163