I'm trying to read data from many different .csv files ( all with the same "structure" ), perform some operations with Spark and finally save them in Hudi format.
To store data in the same Hudi table I thought the best approach would be to use the append method while performing writes.
The issue is that doing this creates tons of small files, whose summed dimension surpasses the input dataset size by a long shot(10x in some cases).
This is my configuration for Hudi:
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'uuid',
'hoodie.datasource.write.partitionpath.field': 'main_partition',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'ts',
'hoodie.upsert.shuffle.parallelism': 10,
'hoodie.insert.shuffle.parallelism': 10,
'hoodie.delete.shuffle.parallelism': 10
}
While the write op is performed like this:
result_df.write.format("hudi").options(**hudi_options).mode("append").save(basePath)
Where the result_df is a Spark Dataframe with always the same schema, but with different data, and the basePath is constant.
I checked the contents of the output files and they have the correct schema/data. So, is there a way to append data to the same Hudi table file?
I'm fairly new to apache Spark and Hudi, so any help/suggestions would be much appreciated ;-)