0

I'm using .NET Spark in a Synapse notebook to transform data. The source data consists of multiple parquet files within subfolders in a directory.

/mnt/source/f1/*.parquet
/mnt/source/f2/*.parquet
/mnt/source/f3/*.parquet
/mnt/source/.../*.parquet

Originally, I tried loading all parquet files in one go and transform them:

var df = spark.Read().Parquet("/source/*/*.parquet");

Using this, the operation would eventually timeout without any output ever generated (we're talking about roughly 50 million source records).

I can successfully transform individual folders though:

var df = spark.Read().Parquet("/source/f1/*.parquet");

This made me wonder if I should slice my data and iterate over the subfolders instead.

Using a simple foreach the job finishes in 50 minutes!

What is the correct approach to solve this? Should I use a foreach, a Parallel.ForEach() or something entirely different (Spark-specific?) to transform all input folders?

Krumelur
  • 32,180
  • 27
  • 124
  • 263
  • try using single `*` as folder. – Pratik Lad Jan 16 '23 at 10:10
  • @PratikLad - I just edited my post. I noticed I wasn't clear what the issue if. Using wildcard path would timeout. – Krumelur Jan 16 '23 at 10:50
  • If you just do a wildcard read and a count (no transformations) is it still timing out? – ScootCork Jan 16 '23 at 11:27
  • @ScootCork Getting counts returns after about 10 minutes with roughly 1 million rows. Note that each row has nested array data, contributing to the total of 50 million. The transformation is running joins on the nested data. (see my edits too) – Krumelur Jan 16 '23 at 11:43
  • Its hard to say why it is timing out, might be that your spark pool is not properly sized for the transformations you're doing, you might also have a large number of small parquet files causing long read times. You'd have to dive into what is happening (query explain plan, spark pool logs etc). – ScootCork Jan 17 '23 at 10:05

0 Answers0