Apache Spark and Hudi: tons of output files

Question

I'm trying to read data from many different .csv files ( all with the same "structure" ), perform some operations with Spark and finally save them in Hudi format.
To store data in the same Hudi table I thought the best approach would be to use the append method while performing writes.
The issue is that doing this creates tons of small files, whose summed dimension surpasses the input dataset size by a long shot(10x in some cases).

This is my configuration for Hudi:

hudi_options = {
  'hoodie.table.name': tableName,
  'hoodie.datasource.write.recordkey.field': 'uuid',
  'hoodie.datasource.write.partitionpath.field': 'main_partition',
  'hoodie.datasource.write.table.name': tableName,
  'hoodie.datasource.write.operation': 'upsert',
  'hoodie.datasource.write.precombine.field': 'ts',
  'hoodie.upsert.shuffle.parallelism': 10, 
  'hoodie.insert.shuffle.parallelism': 10,
  'hoodie.delete.shuffle.parallelism': 10
}

While the write op is performed like this:

result_df.write.format("hudi").options(**hudi_options).mode("append").save(basePath)

Where the result_df is a Spark Dataframe with always the same schema, but with different data, and the basePath is constant.
I checked the contents of the output files and they have the correct schema/data. So, is there a way to append data to the same Hudi table file?

I'm fairly new to apache Spark and Hudi, so any help/suggestions would be much appreciated ;-)

Are you still facing the issue? Apache Hudi works on the principle of MVCC (Multi Versioned Concurrency Control), so every write creates a new version of the the existing file in following scenarios: 1. if the file size is less than the default max file size : 100 MB 2. if you are updating existing records in the existing file. Add these two options to your hudi_options, which keeps only latest two versions at any given time: "hoodie.cleaner.commits.retained": 1, "hoodie.keep.min.commits": 2 If still having issue, please share your complete config details, then I could help you. — Felix K Jose, May 01 '21 at 02:57
Thanks! I had the same problem and your comment was really useful. Now only the latest two versions of the output files are being recorded. — Eduardo Freire Mangabeira, Jun 28 '21 at 17:29

score 0 · Answer 1 · answered Apr 20 '21 at 04:03

0

please raise a github issue(httsp://github.com/apache/hudi/issues) to get timely response from the community

answered Apr 20 '21 at 04:03

sf lee

1

score 0 · Accepted Answer · edited Aug 18 '21 at 16:34

Apache Hudi works on the principle of MVCC (Multi Versioned Concurrency Control), so every write creates a new version of the the existing file in following scenarios: 1. if the file size is less than the default max file size : 100 MB 2. if you are updating existing records in the existing file. Add these two options to your hudi_options, which keeps only latest two versions at any given time: "hoodie.cleaner.commits.retained": 1, "hoodie.keep.min.commits": 2

From a comment by @felix-k-jose

Apache Spark and Hudi: tons of output files

2 Answers2