I am trying to use clustering configurations in Hudi COW table to keep only a single file in the partition folders if the total partition data size is less than 128 MB. But it seems that clustering is not working with bulk_insert as expected. We have few tables in TBs(20TB, 7TB, 3TB) with partition count as 77000. Please find below options which we tried. We are running our pyspark job in EMR serverless 6.8.0.
Hudi write mode as "bulk_insert" mode with below clustering configs.
hoodie.clustering.inline=true hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 hoodie.clustering.plan.strategy.small.file.limit=134217728 hoodie.clustering.plan.strategy.sort.columns=columnA,ColumnB
hoodie.clustering.inline.max.commits=4
Result: Output partition has 26 files of size around 800KB/fileHudi write mode as "bulk_insert" and removed all the clustering configurations.
Result: Output partition has 26 files of size around 800KB/fileHudi write mode as "insert" mode with below clustering configs.
hoodie.clustering.inline=true hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 hoodie.clustering.plan.strategy.small.file.limit=134217728 hoodie.clustering.plan.strategy.sort.columns=columnA,ColumnB
hoodie.clustering.inline.max.commits=4.
Result: Output partition has only 1 file which is of size 11MBHudi write mode as "insert" and removed all the clustering configurations.
Result: Ouput partition has only 1 file which is of size 11MBTried below hudi configurations as well, but still the same above results.
hoodie.parquet.max.file.size=125829120 hoodie.parquet.small.file.limit=104857600 hiidie.clustering.plan.strategy.small.file.limit=600
hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=4
hoodie.clustering.plan.strategy.max.bytes.per.group=1073741824
It seems clustering is not applied while bulk_insert mode, but it is applied in insert mode, Does anybody can tell this is the right approach or I am doing anything wrong here... Your help is highly appreciated.