Hudi data overrides every time on new batch of spark structure streaming

Question

I am working on spark structure streaming where job consuming Kafka message, do aggregation and save data in apache hudi table every 10 seconds. The below code is working fine but it overwrites the resultant apache hudi table data on every batch. I do not yet figure out why it is happening? Is it spark structure streaming or hudi behavior? I am using MERGE_ON_READ so the table file should not delete on every update. But don't know why it is happening? Due to this issue, my other job failed which read this table.

    spark.readStream
                .format('kafka')
                .option("kafka.bootstrap.servers",
                        "localhost:9092")
      ...
      ...                  
    df1 = df.groupby('a', 'b', 'c').agg(sum('d').alias('d'))
    df1.writeStream
              .format('org.apache.hudi')
              .option('hoodie.table.name', 'table1')
              .option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
              .option('hoodie.datasource.write.keygenerator.class', 'org.apache.hudi.keygen.ComplexKeyGenerator')
              .option('hoodie.datasource.write.recordkey.field', "a,b,c")
              .option('hoodie.datasource.write.partitionpath.field', 'a')
              .option('hoodie.datasource.write.table.name', 'table1')
              .option('hoodie.datasource.write.operation', 'upsert')
              .option('hoodie.datasource.write.precombine.field', 'c')
              .outputMode('complete')
              .option('path', '/Users/lucy/hudi/table1')
              .option("checkpointLocation",
                      "/Users/lucy/checkpoint/table1")
              .trigger(processingTime="10 second")
              .start()
              .awaitTermination()

score 0 · Answer 1 · answered Aug 07 '22 at 15:12

Based on your configurations, the explanation for this problem may be that you read the same keys at each batch (the same a, b, c with different value of d), and where you have an upsert operation, hudi relace the old values by the new one. Try using insert instead of upsert or modify the hudi key depending on what you want to do.

score 0 · Answer 2 · answered May 12 '23 at 10:24

0

you should change "outputMode('complete')" to ".outputMode('append')"

answered May 12 '23 at 10:24

wangchuan

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 12 '23 at 15:32

Hudi data overrides every time on new batch of spark structure streaming

2 Answers2