the compaction of the MOR hudi table keeps the old values

Question

I am having hudi table, and I write it as MOR here's the config:

conf = {
    'className': 'org.apache.hudi',
    'hoodie.table.name': hudi_table_name,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
    'hoodie.datasource.write.precombine.field': 'timestamp',
    'hoodie.datasource.write.recordkey.field': 'user_id',
    #'hoodie.datasource.write.partitionpath.field': 'year:SIMPLE,month:SIMPLE,day:SIMPLE',
    #'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
    #'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'DATE_STRING',
    #'hoodie.deltastreamer.keygen.timebased.input.dateformat': 'yyyy-mm-dd',
    #'hoodie.deltastreamer.keygen.timebased.output.dateformat': 'yyyy/MM/dd'
}

I noticed that for each new record I append I had parquet file, and when I update any of them I have a log file contains tha update, and after number of appends the parqeut files compacted into one parquet file. However, this file contains the old values of initially added records, not the updated ones, any clue what I might be doing wrong?

I am writing it as:

glueContext.forEachBatch(
    frame=data_frame_DataSource0,
    batch_function=processBatch,
    options={
        "windowSize": window_size,
        "checkpointLocation": s3_path_spark
    }
)

glueContext.write_dynamic_frame.from_options(
    frame=DynamicFrame.fromDF(df, glueContext, "df"),
    connection_type="custom.spark",
    connection_options=conf
)

can you provide an example with the sequence of the records you insert, and the final result you have in the parquet? — Hussein Awala, Feb 18 '23 at 19:10

the compaction of the MOR hudi table keeps the old values

0 Answers0