I'm using .NET Spark in a Synapse notebook to transform data. The source data consists of multiple parquet files within subfolders in a directory.
/mnt/source/f1/*.parquet
/mnt/source/f2/*.parquet
/mnt/source/f3/*.parquet
/mnt/source/.../*.parquet
Originally, I tried loading all parquet files in one go and transform them:
var df = spark.Read().Parquet("/source/*/*.parquet");
Using this, the operation would eventually timeout without any output ever generated (we're talking about roughly 50 million source records).
I can successfully transform individual folders though:
var df = spark.Read().Parquet("/source/f1/*.parquet");
This made me wonder if I should slice my data and iterate over the subfolders instead.
Using a simple foreach
the job finishes in 50 minutes!
What is the correct approach to solve this? Should I use a foreach
, a Parallel.ForEach()
or something entirely different (Spark-specific?) to transform all input folders?