We have a dataset which runs as an incremental build on our Foundry instance. The dataset is a large time series dataset (56.5 billion rows, 10 columns, 965GB), with timestamps in 1 hour buckets. The dataset grows by around 10GB per day.
In order to optimise the dataset for analysis purposes, we have repartitioned the dataset on two attributes “measure_date” and “measuring_time”.
This reflects the access pattern - the data set is usually accessed by "measure_date". We sub-partition this by "measuring_time" to decrease the size of parquet files being produced, plus filtering on time is a common access pattern as well.
The code which creates the partition is as following:
if ctx.is_incremental:
return df.repartition(24, "measure_date", "measuring_time")
else:
return df.repartition(2200, "measure_date", "measuring_time")
Using the hash partion creates unbalanced file sizes, but this is topic of a different post.
I am now trying to find out, how to make Spark on Foundry utilize the partitions in filter criteria. From what I can see, this is NOT the case.
I created a code workbook and ran the following query on the telemetry data, saving the result to another data set.
SELECT *
FROM telemetry_data
where measure_date = '2022-06-05'
The pyhsical query plan of the build seems to indicate, that Spark is not utilizing any partition, with PartitionFilters being empty in the plan.
Batched: true, BucketedScan: false, DataFilters: [isnotnull(measure_date#170), (measure_date#170 = 19148)],
Format: Parquet, Location: InMemoryFileIndex[sparkfoundry://prodapp06.palantir:8101/datasets/ri.foundry.main.dataset.xxx...,
PartitionFilters: [],
PushedFilters: [IsNotNull(measure_date), EqualTo(measure_date,2022-06-05)],
ReadSchema: struct<xxx,measure_date:date,measuring_time_cet:timestamp,fxxx, ScanMode: RegularMode
How can I make Spark on Foundry use partition pruning?