How to optimally use my Spark resources in Azure Synapse when calling a notebook reference in a loop?

Question

I'm running a data transformation in Synapse and would like to speed it up.

My pool is configured as "4 vCores, 28GB of memory with dynamic executors from 1..7".

My data in ADL Gen2 consists of roughly 300 directories. Every directory holds between 100 and 70,000 JSON files.

These JSON files need to be converted into parquet format, then the parquet is transformed (splitting out a nested array), and the results are again stored in ADL using parquet.

I created a notebook that accepts a single directory as input, creates the intermediary parquet data and transforms that data into the final structure. For the largest directory (70k files) this takes about 10 minutes, most directories complete within seconds to minutes.

Another notebook gets all directories and executes the notebook responsible for processing a single directory:

from notebookutils import mssparkutils

for folder in folders:
    print('Processing folder ' + folder + '(' + str(folders.index(folder) + 1) + ' of ' + str(len(folders)) + ')')

    mssparkutils.notebook.run("RawJson/CopyToDataLake_SingleFolder", 1800, {
        "folderKey": folder,
        "outputSubfolder": outputSubfolder
    })

Almost all time is spent converting the thousands of JSON files into parquet (see below, excerpt from the notebook called in the loop above), the final transformation (flattening nested arrays) is almost neglectable:

from notebookutils import mssparkutils

jsonDataInputPath = '{}/{}'.format(dataLakeRoot, folderKey)
repartitionedDataOutputPath = '{}/repartitioned/{}'.format(dataLakeRoot, folderKey)

# Read JSON.
input_df = spark \
    .read \
    .schema(jsonSchema) \
    .json('{}/*/*.json'.format(jsonDataInputPath), multiLine=True)

# Reduce to a max of 16 partitions.
partitioned_df = input_df.coalesce(16)

# Store as parquet.
partitioned_df \
    .write \
    .mode("overwrite") \
    .parquet(repartitionedDataOutputPath)

The job runs for hours and I would like to speed it up. If I check the session configuration while it is running I always see 0% utilization and although my pool is configured for dynamic executors, only 2 are used:

Can I run multiple notebook executions parallel?

How to optimally use my Spark resources in Azure Synapse when calling a notebook reference in a loop?

0 Answers0