I have a MOR table with key = acctid, when I do 3 commits on same key and try to read in incremental mode I see only 1st commit, is there anyway to read the last commit or all commits for a given key using incremental mode ?
Please check details below:
I have following data inserted in mor table in first run
input_df = spark.createDataFrame(
[
(100, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
(101, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
(102, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
(103, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
(104, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
(105, "2015-01-01", "2015-01-01T01:01:01.010101Z", 10),
],
("acctid", "date", "ts", "deposit"),
)
hudi options are:
hudi_options=
{'hoodie.table.name': 'compaction',
'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.recordkey.field': 'acctid',
'hoodie.datasource.write.partitionpath.field': 'date',
'hoodie.datasource.write.precombine.field': 'ts',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'hoodie.delete.shuffle.parallelism': 2}
After that I run update for key 100 by passing 3 different value of ts and deposit so that 3 commits are done on same key
# UPDATE deposit to **11** for key 100
update_df = spark.createDataFrame(
[(100, "2015-01-01", "2015-01-01T11:01:01.000000Z", 11)],("acctid", "date", "ts", "deposit"))
update_df.write.format("org.apache.hudi").options(**hudi_options).mode("append").save(hudi_dataset)
# UPDATE deposit to **12** for key 100
update_df = spark.createDataFrame(
[(100, "2015-01-01", "2015-01-01T12:01:01.000000Z", 12)],("acctid", "date", "ts", "deposit"))
update_df.write.format("org.apache.hudi").options(**hudi_options).mode("append").save(hudi_dataset)
# UPDATE deposit to **13** for key 100
update_df = spark.createDataFrame(
[(100, "2015-01-01", "2015-01-01T13:01:01.000000Z", 13)],("acctid", "date", "ts", "deposit"))
update_df.write.format("org.apache.hudi").options(**hudi_options).mode("append").save(hudi_dataset)
first_commit = '20210719234312' # As per this particular run
output_df = (spark.read
.option("hoodie.datasource.query.type", "incremental")
.option("hoodie.datasource.read.begin.instanttime", first_commit)
.format("org.apache.hudi")
.load(hudi_dataset+"/*/*"))
output_df.show()
In this output I see the deposit = 11, is there any way to get deposit = 13 in incremental mode without using compaction ?