This is very bad partitioning schema. You simply have too many unique values for column A
, and additional partitioning is creating even more partitions. Spark will need to create at least 90k partitions, and this will require creation a separate files (small), etc. And small files are harming the performance.
For non-Delta tables, partitioning is primarily used to perform data skipping when reading data. But for Delta lake tables, partitioning may not be so important, as Delta on Databricks includes things like data skipping, you can apply ZOrder, etc.
I would recommend to use different partitioning schema, for example, year
+ month
only, and do OPTIMIZE with ZOrder on A
column after the data is written. This will lead to creation of only few partitions with bigger files.