0

I am trying to create a simple hudi table with MERGE_ON_READ table type. After executing the code still in hoodie.properties file I see hoodie.table.type=COPY_ON_WRITE

Am I missing something here ?

Jupyter Notebook for this code: https://github.com/sannidhiteredesai/spark/blob/master/hudi_acct.ipynb

hudi_options = {
    "hoodie.table.name": "hudi_acct",
    "hoodie.table.type": "MERGE_ON_READ",
    "hoodie.datasource.write.operation": "upsert",
    "hoodie.datasource.write.recordkey.field": "acctid",
    "hoodie.datasource.write.precombine.field": "ts",
    "hoodie.datasource.write.partitionpath.field": "date",
    "hoodie.datasource.write.hive_style_partitioning": "true",
    "hoodie.upsert.shuffle.parallelism": 8,
    "hoodie.insert.shuffle.parallelism": 8,
}

input_df = spark.createDataFrame(
    [
        (100, "2015-01-01", "2015-01-01T13:51:39.340396Z", 10),
        (101, "2015-01-01", "2015-01-01T12:14:58.597216Z", 10),
        (102, "2015-01-01", "2015-01-01T13:51:40.417052Z", 10),
        (103, "2015-01-01", "2015-01-01T13:51:40.519832Z", 10),
        (104, "2015-01-02", "2015-01-01T12:15:00.512679Z", 10),
        (104, "2015-01-02", "2015-01-01T12:15:00.512679Z", 10),
        (104, "2015-01-02", "2015-01-02T12:15:00.512679Z", 20),
        (105, "2015-01-02", "2015-01-01T13:51:42.248818Z", 10),
    ],
    ("acctid", "date", "ts", "deposit"),
)

# INSERT
(
    input_df.write.format("org.apache.hudi")
    .options(**hudi_options)
    .mode("append")
    .save(hudi_dataset)
)


update_df = spark.createDataFrame(
    [(100, "2015-01-01", "2015-01-01T13:51:39.340396Z", 20)],
    ("acctid", "date", "ts", "deposit"))

# UPDATE
(
    update_df.write.format("org.apache.hudi")
    .options(**hudi_options)
    .mode("append")
    .save(hudi_dataset)
)

Edit: After execution of above code I see 2 parquet files created in the date=2015-01-01 partition. On reading the 2nd parquet file I was expecting to get only the updated 1 record, but I can see all other records in that partition as well.

sannidhi
  • 23
  • 5

2 Answers2

1

The issue is with "hoodie.table.type": "MERGE_ON_READ", configuration. You have to use hoodie.datasource.write.table.type instead. If you update the configuration as follows it will work. I have tested.

hudi_options = {
    "hoodie.table.name": "hudi_acct",
    "hoodie.datasource.write.table.type": "MERGE_ON_WRITE",
    "hoodie.datasource.write.operation": "upsert",
    "hoodie.datasource.write.recordkey.field": "acctid",
    "hoodie.datasource.write.precombine.field": "ts",
    "hoodie.datasource.write.partitionpath.field": "date",
    "hoodie.datasource.write.hive_style_partitioning": "true",
    "hoodie.upsert.shuffle.parallelism": 8,
    "hoodie.insert.shuffle.parallelism": 8,
    "hoodie.compact.inline": "true",
    "hoodie.compact.inline.max.delta.commits": 10
}
Felix K Jose
  • 782
  • 7
  • 10
0

would you please try mode("overwrite") when using insert to load data into hudi first and see if it works?

sf lee
  • 1
  • I tried with overwrite mode still it is showing same results as COPY ON READ..... Is it something related to my input data size ? Because input data as well as updated data is very small does hudi use COPY_ON_WRITE by default ? – sannidhi Jul 11 '21 at 12:30
  • https://github.com/sannidhiteredesai/spark/blob/master/hudi_acct.ipynb – sannidhi Jul 11 '21 at 12:52
  • no, it should not be related to you data size. does table name correct in hoodie.properties? – sf lee Jul 12 '21 at 14:57
  • Yes table name is correct. This is in hoodie.properties hoodie.table.precombine.field=ts hoodie.table.name=hudi_acct hoodie.archivelog.folder=archived hoodie.table.type=COPY_ON_WRITE hoodie.table.version=1 hoodie.timeline.layout.version=1 – sannidhi Jul 12 '21 at 18:54