How to get the last commit for a key when reading apache hudi MOR table reading in incremental mode?

Question

I have a MOR table with key = acctid, when I do 3 commits on same key and try to read in incremental mode I see only 1st commit, is there anyway to read the last commit or all commits for a given key using incremental mode ?

Please check details below:

I have following data inserted in mor table in first run

input_df = spark.createDataFrame(
    [
        (100, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
        (101, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
        (102, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
        (103, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
        (104, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
        (105, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
    ],
    ("acctid", "date", "ts", "deposit"),
)

hudi options are:

hudi_options=
{'hoodie.table.name': 'compaction',
 'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
 'hoodie.datasource.write.operation': 'upsert',
 'hoodie.datasource.write.recordkey.field': 'acctid',
 'hoodie.datasource.write.partitionpath.field': 'date',
 'hoodie.datasource.write.precombine.field': 'ts',
 'hoodie.datasource.write.hive_style_partitioning': 'true',
 'hoodie.upsert.shuffle.parallelism': 2,
 'hoodie.insert.shuffle.parallelism': 2,
 'hoodie.delete.shuffle.parallelism': 2}

After that I run update for key 100 by passing 3 different value of ts and deposit so that 3 commits are done on same key

# UPDATE deposit to **11** for key 100
update_df = spark.createDataFrame(
    [(100, "2015-01-01", "2015-01-01T11:01:01.000000Z", 11)],("acctid", "date", "ts", "deposit"))
update_df.write.format("org.apache.hudi").options(**hudi_options).mode("append").save(hudi_dataset)

# UPDATE deposit to **12** for key 100
update_df = spark.createDataFrame(
    [(100, "2015-01-01", "2015-01-01T12:01:01.000000Z", 12)],("acctid", "date", "ts", "deposit"))
update_df.write.format("org.apache.hudi").options(**hudi_options).mode("append").save(hudi_dataset)

# UPDATE deposit to **13** for key 100
update_df = spark.createDataFrame(
    [(100, "2015-01-01", "2015-01-01T13:01:01.000000Z", 13)],("acctid", "date", "ts", "deposit"))
update_df.write.format("org.apache.hudi").options(**hudi_options).mode("append").save(hudi_dataset)

first_commit = '20210719234312' # As per this particular run

output_df = (spark.read
             .option("hoodie.datasource.query.type", "incremental")
             .option("hoodie.datasource.read.begin.instanttime", first_commit)
             .format("org.apache.hudi")
             .load(hudi_dataset+"/*/*"))

output_df.show()

In this output I see the deposit = 11, is there any way to get deposit = 13 in incremental mode without using compaction ?

How to get the last commit for a key when reading apache hudi MOR table reading in incremental mode?

0 Answers0