I have a databricks delta table which is around 400GB and is non partitioned(databricks recommends not to partition if size < 1 TB), this table is target of a streaming pipeline. I have Z ordered this table based on occurence_dttm(data type is string but holds timestamp column -sample value - 2022-03-15T22:41:30.011Z). Z order command I use is '''
OPTIMIZE tablename
ZORDER BY(occurence_dttm)
''' I am running this optimize operation daily once. Occurence_dttm is a column we directly get from source and it could contain data for today's datetimestampe as well as yesterdays but not before that. But now the problem is that when I run the optimize command with Z order, most of the time this ends up doing the full optimize and takes 4 hours to complete, only once it did an incremental optimize operation which got completed in 7 mins. Below is the output of Optimize operation when it did an incremental one. '''
{"numFilesAdded": 1, "numFilesRemoved": 100, "filesAdded": {"min": 55277570, "max": 55277570, "avg": 55277570, "totalFiles": 1, "totalSize": 55277570}, "filesRemoved": {"min": 428703, "max": 4891829, "avg": 957992.17, "totalFiles": 100, "totalSize": 95799217}, "partitionsOptimized": 0, "zOrderStats": {"strategyName": "minCubeSize(107374182400)", "inputCubeFiles": {"num": 514, "size": 108596707661}, "inputOtherFiles": {"num": 100, "size": 95799217}, "inputNumCubes": 1, "mergedFiles": {"num": 100, "size": 95799217}, "numOutputCubes": 1, "mergedNumCubes": null}, "numBatches": 1, "totalConsideredFiles": 614, "totalFilesSkipped": 514, "preserveInsertionOrder": false}
'''
Below is the output when it does full optimize. '''
{"numFilesAdded": 396, "numFilesRemoved": 739, "filesAdded": {"min": 57810288, "max": 442186574, "avg": 268744949.3055556, "totalFiles": 396, "totalSize": 106422999925}, "filesRemoved": {"min": 398923, "max": 430759983, "avg": 143900314.78213802, "totalFiles": 739, "totalSize": 106342332624}, "partitionsOptimized": 0, "zOrderStats": {"strategyName": "minCubeSize(107374182400)", "inputCubeFiles": {"num": 0, "size": 0}, "inputOtherFiles": {"num": 739, "size": 106342332624}, "inputNumCubes": 0, "mergedFiles": {"num": 739, "size": 106342332624}, "numOutputCubes": 1, "mergedNumCubes": null}, "numBatches": 1, "totalConsideredFiles": 739, "totalFilesSkipped": 0, "preserveInsertionOrder": false}
''' I am trying to understand, under what conditions databricks decide to do full or incremental Z order operation?