0

I'm performance optimizing my pipeline, and when I open Job Tracker for my Transform job, I notice that there's several stages at the beginning of the job for something called ExecuteStats.scala. Is there any way to optimize my job by removing / skipping these? They typically take tens of seconds and they occur every time I run my transformation.

vanhooser
  • 1,497
  • 3
  • 19

1 Answers1

0

This stage type is executed when your files don't yet have statistics computed on them, i.e. if you have ingested non-parquet (or, more generally, files that have summary statistics on them) files.

Let's imagine you uploaded a .csv file via Data Connection or manually in the Foundry UI. When you do this, you apply a schema, and Spark is able to read the file and run computations on top of it. However, Spark needs to understand the distributions of values on the file contents in order to make estimations of join strategies, AQE optimizations, and other related things. Therefore, before you are able to run any computation, each .csv file has a stage executed on it to compute these stats.

This means every time you run a downstream transformation on these non-parquet files, it re-runs the statistics. You can imagine how Spark's tendency to re-run stages when running larger jobs can mean this stats problem is magnified.

Instead, you can inject a step immediately after the .csv file whereby you perform a select * repartition(1) and write out a single parquet file (if that is the appropriate number of files for your .csv size), and Foundry will compute statistics on the contents one time. Then, your downstream transformations should use this new input instead of the .csv, and you'll see the ExecuteStats.scala command isn't run anymore.

vanhooser
  • 1,497
  • 3
  • 19