Partition sizes with high cardinality column (Timestamp) with Ingestion Time Clustering

Asked Jul 25 '23 at 15:28

Active Jul 25 '23 at 15:28

Viewed 35 times

I am appending rows to a delta table once per day:

(df
.write
.mode("append")
.format("delta")
.save("/mytable")
)

On each append, a new partition file is created. The problem is that each partition is only around 1mb, so as this table grows, I will have thousands of tiny partitions.

Databricks recommends that each partition be at least 1GB.

Each row has a different TimeStamp value. How can I confirm that Ingestion Time Clustering has identified the TimeStamp column and is working?

Maybe it would make sense to create a "Month" or "Year" column and repartition by that so each partition is larger? Would ingestion time clustering on the TimeStamp column be relevant then? I know Ingestion Time Clustering is a good tool for high cardinality column, but it is useful for this case?

asked Jul 25 '23 at 15:28

Oliver Angelil

1,099
15
31

Partition sizes with high cardinality column (Timestamp) with Ingestion Time Clustering

0 Answers0