0

I am working on setting up a data pipeline that produces a huge number of small files in delta tables which is partitioned by certain dimensions. To boost the read performance for the consumers, I am looking to add the compaction. Looking at the delta optimize compaction, it works very well when the data is non-incrementally added into a partition. But I have the data that is being incrementally updated to the same partition and running the optimize compaction on the same partition twice, it re-compacts the previous compacted data making it harder for the consumer since they now get the redundant data. Also, we have a scenario of backfill, where the data can be backfilled to older partitions. In this case, we run the compaction on the old partitions, the entire partition compacted data will be re-compacted with the newly added data leading to lot of duplicates. Is there a way specify predicates to only compact the newly added data in the same partition disgregarding the compacted data? I do not want to run OPTIMIZE on data that’s already been compacted

DeltaTable deltaTable = DeltaTable.forPath(table);
Dataset<Row> compactionResult = deltaTable.optimize()
  .where(partitionFilter).executeCompaction();

The data is getting compacted again & again leading to duplicates.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
prlucknow
  • 13
  • 3

0 Answers0