0

I am reading https://medium.com/slalom-build/data-lakehouse-building-the-next-generation-of-data-lakes-using-apache-hudi-41550f62f5f

I cannot understand the following piece of codes. it seems that upserts CDC events applied before delete CDC events.

# invoke hudi_write function for upserts
    if df_w_upserts and df_w_upserts.count() > 0:
        hudi_write(
            df=df_w_upserts,
            schema="schema_name",
            table="table_name",
            path=path,
            mode="append",
            hudi_options=hudi_options
        )

     # invoke hudi_write function for deletes
    if df_w_deletes and df_w_deletes.count() > 0:
        hudi_options_copy = copy.deepcopy(hudi_options)
        hudi_options_copy["hoodie.datasource.write.operation"] = "delete"
        hudi_options_copy["hoodie.bloom.index.update.partition.path"] = False

        hudi_write(
            df=df_w_deletes,
            schema="schema_name",
            table="table_name",
            path=path,
            mode="append",
            hudi_options=hudi_options_copy
        )

My question is: how to keep the cdc delete/upsert order and apply them? How about one record got deleted than inserted?

Thanks in advance.

BAE
  • 8,550
  • 22
  • 88
  • 171

0 Answers0