I am appending rows to a delta table once per day:
(df
.write
.mode("append")
.format("delta")
.save("/mytable")
)
On each append, a new partition file is created. The problem is that each partition is only around 1mb, so as this table grows, I will have thousands of tiny partitions.
Databricks recommends that each partition be at least 1GB.
Each row has a different TimeStamp value. How can I confirm that Ingestion Time Clustering has identified the TimeStamp column and is working?
Maybe it would make sense to create a "Month" or "Year" column and repartition by that so each partition is larger? Would ingestion time clustering on the TimeStamp column be relevant then? I know Ingestion Time Clustering is a good tool for high cardinality column, but it is useful for this case?