3

We have a dataset which runs as an incremental build on our Foundry instance. The dataset is a large time series dataset (56.5 billion rows, 10 columns, 965GB), with timestamps in 1 hour buckets. The dataset grows by around 10GB per day.

In order to optimise the dataset for analysis purposes, we have repartitioned the dataset on two attributes “measure_date” and “measuring_time”.

This reflects the access pattern - the data set is usually accessed by "measure_date". We sub-partition this by "measuring_time" to decrease the size of parquet files being produced, plus filtering on time is a common access pattern as well.

The code which creates the partition is as following:

if ctx.is_incremental:
    return df.repartition(24, "measure_date", "measuring_time")
else:
    return df.repartition(2200, "measure_date", "measuring_time")

Using the hash partion creates unbalanced file sizes, but this is topic of a different post.

I am now trying to find out, how to make Spark on Foundry utilize the partitions in filter criteria. From what I can see, this is NOT the case.

I created a code workbook and ran the following query on the telemetry data, saving the result to another data set.

SELECT * 
FROM telemetry_data
where measure_date = '2022-06-05'

The pyhsical query plan of the build seems to indicate, that Spark is not utilizing any partition, with PartitionFilters being empty in the plan.

Batched: true, BucketedScan: false, DataFilters: [isnotnull(measure_date#170), (measure_date#170 = 19148)], 
            Format: Parquet, Location: InMemoryFileIndex[sparkfoundry://prodapp06.palantir:8101/datasets/ri.foundry.main.dataset.xxx..., 
            PartitionFilters: [],
            PushedFilters: [IsNotNull(measure_date), EqualTo(measure_date,2022-06-05)],
            ReadSchema: struct<xxx,measure_date:date,measuring_time_cet:timestamp,fxxx, ScanMode: RegularMode

How can I make Spark on Foundry use partition pruning?

twinkle2
  • 341
  • 1
  • 8
  • spark3 supports dynamic partition pruning which points at "it should be supported" https://dzone.com/articles/dynamic-partition-pruning-in-spark-30 – fmsf Jun 07 '22 at 11:11

1 Answers1

3

I believe you need to use transforms.api.IncrementalTransformOutput.write_dataframe() with partitionBy=['measure_date', 'measuring_time'] to achieve what you are looking for.

Check the foundry docs for more.

  • @twinkle2 can you try it with write_dataframe instead of the return and see if it helps? I was discussing with some team members and this seems like the correct way. – fmsf Jun 07 '22 at 21:33
  • yes, using write data_dataframe works perfectly: first indication the file path/names of the dataset contain the name of partition attributes if I use write_dataframe - this was not the case previously (return in transform_df). If I query the dataset I can see in the physical plan, that PartitionFilters are used. – twinkle2 Jun 08 '22 at 19:16