More than 1 column in record key in spark Hudi Job while making an upsert

Question

I am currently doing a POC on deltalake where I came across this framework called Apache Hudi. Below is the data I am trying to write using apache spark framework.

 private val INITIAL_ALBUM_DATA = Seq(
Album(800,810, "6 String Theory", Array("Lay it down", "Am I Wrong", "68"), dateToLong("2019-12-01")),
Album(801,811, "Hail to the Thief", Array("2+2=5", "Backdrifts"), dateToLong("2019-12-01")),
Album(801,811, "Hail to the Thief", Array("2+2=5", "Backdrifts", "Go to sleep"), dateToLong("2019-12-03"))
)

The class : 
case class Album(albumId: Long,trackId: Long, title: String, tracks: Array[String], updateDate: Long)

So I want to make an upsert using the record key as albumId and trackId. So I tried to make initial insert using below code(albumDf is the dataframe created from above INITIAL_ALBUM_DATA) :

albumDf.write
.format("hudi")
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "albumId, trackId")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
.option("hoodie.upsert.shuffle.parallelism", "2")
.mode(SaveMode.Append)
.save(s"$basePath/$tableName/")

But seems like it doesn't write with multiple keys. The error I get while running above is :

... 5 more
Caused by: org.apache.hudi.exception.HoodieKeyException: recordKey value: "null" for field: "albumId, 
trackId" cannot be null or empty.
at org.apache.hudi.keygen.SimpleKeyGenerator.getKe

has anyone tried it with multiple keys ? When I am trying with single key either trackId or albumId, it works as charm but with 2 keys it fails. Currently I am using Hudi's 0.5.3 and scala's 2.11 version with spark being 2.4.x. I have tried with Hudi's 0.5.2-incubating/0.6.0 as well.

score 2 · Accepted Answer · answered Sep 02 '20 at 05:29

2

This can be solved using ComplexKeyGenerator instead of SimplekeyGenerator.

answered Sep 02 '20 at 05:29

Azam Khan

516
5
12

Yes, This works. Thanks Azam for taking out your time. – user3199285 Sep 02 '20 at 05:34
@Azam Khan after using ComplexKeyGenerator I am getting "Exception occurred while writing data set : Failed to upsert for commit time 20201105035441" error ...what could have been wrong ? – BdEngineer Nov 05 '20 at 09:09

score 1 · Answer 2 · answered Sep 02 '20 at 19:14

1

You could use ComplexKeyGenerator or CustomKeyGenerator for the same.

answered Sep 02 '20 at 19:14

Pratyaksh Sharma

159
2
4

after using ComplexKeyGenerator I am getting "Exception occurred while writing data set : Failed to upsert for commit time 20201105035441" error ...what could have been wrong ? – BdEngineer Nov 05 '20 at 09:10
Were you able to solve this? If you share the complete stacktrace, that will be useful. – Pratyaksh Sharma Feb 26 '21 at 09:08
@ Pratyaksh Sharma It was coming in my DEV environment , once i run the job on EMR cluster i dont face this issue – BdEngineer Feb 26 '21 at 10:02

More than 1 column in record key in spark Hudi Job while making an upsert

2 Answers2