How Can Apache Hudi merge delta asynchronously?

Question

I'm new to Apache Hudi.

In Apache Hudi, merge on read table type merge delta data asynchronously. It is merged when data is queried or the merge config(interval or unmerged commit count) is meet.

But Hudi has not own background process, otherwise Hive. How can Hudi merge data?

thanks.

score 0 · Answer 1 · answered Jul 01 '21 at 18:04

You can use following hudi options when you write:

hudi_options = {
  "hoodie.table.name": "<YOUR TABLE NAME>",
  "hoodie.datasource.write.table.type": "MERGE_ON_READ",
  "hoodie.datasource.write.recordkey.field": "<YOUR UNIQUE KEY>",
  "hoodie.datasource.write.precombine.field": "<YOUR PRECOMBINE FIELD>",
  "hoodie.datasource.write.hive_style_partitioning": "true",
  "hoodie.index.type": "BLOOM",
  "hoodie.bloom.index.filter.type": "DYNAMIC_V0",
  "hoodie.compact.inline": "true",
  "hoodie.compact.inline.max.delta.commits": 10 <YOU CAN PROVIDE SEP VAL BASED ON YOUR NEED>
}
inputDF.write.format('org.apache.hudi').option('hoodie.datasource.write.operation', 'upsert').options(**hudi_options).mode('append').save('<YOUR OUTPUTPATH>')

hoodie.compact.inline and 'hoodie.compact.inline.max.delta.commits' helps hudi to perform the merge. Once the value mentioned in the hoodie.compact.inline.max.delta.commits reaches hudi performs merge operation

Please let me know if it helps.

But what process can do this job? After Hudi writing is done. — SHRIN, Jul 12 '21 at 08:05
@SHRIN Hudi automatically handles it for you after every `max.delta.commits`. That means after every 10 commits - hudi automatically runs a compaction that merges delta logs(avro) with base files(parquet) and generate new columnar (parquet) files. So you dont need to do anything except having this configuration in your job — Felix K Jose, Jul 12 '21 at 12:17
@SHRIN Please let me know if it works as well as answers your question. — Felix K Jose, Jul 12 '21 at 18:51

How Can Apache Hudi merge delta asynchronously?

1 Answers1