I am currently doing a POC on deltalake where I came across this framework called Apache Hudi. Below is the data I am trying to write using apache spark framework.
private val INITIAL_ALBUM_DATA = Seq(
Album(800,810, "6 String Theory", Array("Lay it down", "Am I Wrong", "68"), dateToLong("2019-12-01")),
Album(801,811, "Hail to the Thief", Array("2+2=5", "Backdrifts"), dateToLong("2019-12-01")),
Album(801,811, "Hail to the Thief", Array("2+2=5", "Backdrifts", "Go to sleep"), dateToLong("2019-12-03"))
)
The class :
case class Album(albumId: Long,trackId: Long, title: String, tracks: Array[String], updateDate: Long)
So I want to make an upsert using the record key as albumId and trackId. So I tried to make initial insert using below code(albumDf is the dataframe created from above INITIAL_ALBUM_DATA) :
albumDf.write
.format("hudi")
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "albumId, trackId")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
.option("hoodie.upsert.shuffle.parallelism", "2")
.mode(SaveMode.Append)
.save(s"$basePath/$tableName/")
But seems like it doesn't write with multiple keys. The error I get while running above is :
... 5 more
Caused by: org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "albumId,
trackId" cannot be null or empty.
at org.apache.hudi.keygen.SimpleKeyGenerator.getKe
has anyone tried it with multiple keys ? When I am trying with single key either trackId or albumId, it works as charm but with 2 keys it fails. Currently I am using Hudi's 0.5.3 and scala's 2.11 version with spark being 2.4.x. I have tried with Hudi's 0.5.2-incubating/0.6.0 as well.