0

Here I am trying to simulate updates and deletes over a Hudi dataset and wish to see the state reflected in Athena table. We use EMR, S3 and Athena services of AWS.

  1. Attempting Record Update with a withdrawal object
withdrawalID_mutate = 10382495
updateDF = final_df.filter(col("withdrawalID") == withdrawalID_mutate) \ 
    .withColumn("accountHolderName", lit("Hudi_Updated"))  
    
updateDF.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save(tablePath) 
    
hudiDF = spark.read \
    .format("hudi") \
    .load(tablePath).filter(col("withdrawalID") == withdrawalID_mutate).show() 

Shows the updated record but it is actually appended in the Athena table. Probably something to do with Glue Catalogue?

  1. Attempting Record Delete
deleteDF = updateDF #deleting the updated record above 
    
deleteDF.write.format("hudi") \ 
    .option('hoodie.datasource.write.operation', 'upsert') \
    .option('hoodie.datasource.write.payload.class', 'org.apache.hudi.common.model.EmptyHoodieRecordPayload') \
    .options(**hudi_options) \
    .mode("append") \
    .save(tablePath) 

still reflects the deleted record in the Athena table

Also tried using mode("overwrite") but as expected it deletes the older partitions and keeps only the latest.

Did anyone faced same issue and can guide in the right direction

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Could you provide the hudi_options details? – Felix K Jose Aug 19 '21 at 19:38
  • `hudiOptions01 = { 'hoodie.table.name': tableName, 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.write.recordkey.field': primaryKeyColumn, 'hoodie.datasource.write.partitionpath.field': partitionColumn, }` – jishmisc28 Aug 20 '21 at 12:53
  • `hudiOptions02 = { 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': tableName, 'hoodie.datasource.hive_sync.jdbcurl': 'jdbc:hive2://localhost:10000', 'hoodie.datasource.hive_sync.assume_date_partitioning': 'true', 'hoodie.datasource.hive_sync.partition_fields': partitionColumn, }` – jishmisc28 Aug 20 '21 at 12:54
  • Since comment box does not letting me post the complete json at once, I made two splits above. – jishmisc28 Aug 20 '21 at 12:55
  • Could you also share two sample records? Are you not providing ``"hoodie.datasource.write.precombine.field"``? – Felix K Jose Aug 20 '21 at 19:01
  • Also what is your `partitionColumn` value? – dacort Sep 23 '21 at 18:49
  • I'am facing this same issue, did your issue got resolved ? If yes can you please share the how did you got it resolved ? – user1119283 May 17 '22 at 08:55

0 Answers0