Synapse Apache .NET Spark (C#) parallel execution

Question

I'm using .NET Spark in a Synapse notebook to transform data. The source data consists of multiple parquet files within subfolders in a directory.

/mnt/source/f1/*.parquet
/mnt/source/f2/*.parquet
/mnt/source/f3/*.parquet
/mnt/source/.../*.parquet

Originally, I tried loading all parquet files in one go and transform them:

var df = spark.Read().Parquet("/source/*/*.parquet");

Using this, the operation would eventually timeout without any output ever generated (we're talking about roughly 50 million source records).

I can successfully transform individual folders though:

var df = spark.Read().Parquet("/source/f1/*.parquet");

This made me wonder if I should slice my data and iterate over the subfolders instead.

Using a simple foreach the job finishes in 50 minutes!

What is the correct approach to solve this? Should I use a foreach, a Parallel.ForEach() or something entirely different (Spark-specific?) to transform all input folders?

@PratikLad - I just edited my post. I noticed I wasn't clear what the issue if. Using wildcard path would timeout. — Krumelur, Jan 16 '23 at 10:50
If you just do a wildcard read and a count (no transformations) is it still timing out? — ScootCork, Jan 16 '23 at 11:27
@ScootCork Getting counts returns after about 10 minutes with roughly 1 million rows. Note that each row has nested array data, contributing to the total of 50 million. The transformation is running joins on the nested data. (see my edits too) — Krumelur, Jan 16 '23 at 11:43
Its hard to say why it is timing out, might be that your spark pool is not properly sized for the transformations you're doing, you might also have a large number of small parquet files causing long read times. You'd have to dive into what is happening (query explain plan, spark pool logs etc). — ScootCork, Jan 17 '23 at 10:05

Synapse Apache .NET Spark (C#) parallel execution

0 Answers0