Hudi COW table - Bulks_Insert produces more number of files while clustering is enabled compare to Insert mode

Question

I am trying to use clustering configurations in Hudi COW table to keep only a single file in the partition folders if the total partition data size is less than 128 MB. But it seems that clustering is not working with bulk_insert as expected. We have few tables in TBs(20TB, 7TB, 3TB) with partition count as 77000. Please find below options which we tried. We are running our pyspark job in EMR serverless 6.8.0.

Hudi write mode as "bulk_insert" mode with below clustering configs.

hoodie.clustering.inline=true hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 hoodie.clustering.plan.strategy.small.file.limit=134217728 hoodie.clustering.plan.strategy.sort.columns=columnA,ColumnB
hoodie.clustering.inline.max.commits=4
Result: Output partition has 26 files of size around 800KB/file
Hudi write mode as "bulk_insert" and removed all the clustering configurations.
Result: Output partition has 26 files of size around 800KB/file
Hudi write mode as "insert" mode with below clustering configs.

hoodie.clustering.inline=true hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824 hoodie.clustering.plan.strategy.small.file.limit=134217728 hoodie.clustering.plan.strategy.sort.columns=columnA,ColumnB
hoodie.clustering.inline.max.commits=4.
Result: Output partition has only 1 file which is of size 11MB
Hudi write mode as "insert" and removed all the clustering configurations.
Result: Ouput partition has only 1 file which is of size 11MB
Tried below hudi configurations as well, but still the same above results.

hoodie.parquet.max.file.size=125829120 hoodie.parquet.small.file.limit=104857600 hiidie.clustering.plan.strategy.small.file.limit=600
hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=4
hoodie.clustering.plan.strategy.max.bytes.per.group=1073741824

It seems clustering is not applied while bulk_insert mode, but it is applied in insert mode, Does anybody can tell this is the right approach or I am doing anything wrong here... Your help is highly appreciated.

score 0 · Answer 1 · answered May 11 '23 at 19:30

Did you wait 4 commits to let the clustering service be triggered according to hoodie.clustering.inline.max.commits=4

One common misunderstanding is in hudi you first write the data w/o clustering and then it will trigger based on rules to rewrite the files.

Then in your case likely the clustering never happened. The reason insert produces larger files is this operation is designed for that, while bulk_insert just uses spark vanilla mecanims to write ; you still able to use coalesce to produce smaller files.

If you d'like to use bulk insert and apply custom transformation such clustering, sorting you can use your own logic in a custom partitioner see https://hudi.apache.org/docs/configurations/#hoodiebulkinsertuserdefinedpartitionerclass That way you would write the parquet files directly right without having the need to rely on a service to compact the files.

Hudi COW table - Bulks_Insert produces more number of files while clustering is enabled compare to Insert mode

1 Answers1