0

I am having hudi table, and I write it as MOR here's the config:

conf = {
    'className': 'org.apache.hudi',
    'hoodie.table.name': hudi_table_name,
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
    'hoodie.datasource.write.precombine.field': 'timestamp',
    'hoodie.datasource.write.recordkey.field': 'user_id',
    #'hoodie.datasource.write.partitionpath.field': 'year:SIMPLE,month:SIMPLE,day:SIMPLE',
    #'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
    #'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'DATE_STRING',
    #'hoodie.deltastreamer.keygen.timebased.input.dateformat': 'yyyy-mm-dd',
    #'hoodie.deltastreamer.keygen.timebased.output.dateformat': 'yyyy/MM/dd'
}

I noticed that for each new record I append I had parquet file, and when I update any of them I have a log file contains tha update, and after number of appends the parqeut files compacted into one parquet file. However, this file contains the old values of initially added records, not the updated ones, any clue what I might be doing wrong?

I am writing it as:

glueContext.forEachBatch(
    frame=data_frame_DataSource0,
    batch_function=processBatch,
    options={
        "windowSize": window_size,
        "checkpointLocation": s3_path_spark
    }
)

glueContext.write_dynamic_frame.from_options(
    frame=DynamicFrame.fromDF(df, glueContext, "df"),
    connection_type="custom.spark",
    connection_options=conf
)
Hussein Awala
  • 4,285
  • 2
  • 9
  • 23
Mee
  • 1,413
  • 5
  • 24
  • 40

0 Answers0