I need to create a Hudi table from a PySpark DataFrame using Azure Databricks notebook and save it into Azure DataLake Gen2. Here's my approach:
spark.sparkContext.setSystemProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark.sparkContext.setSystemProperty("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
tableName = "my_hudi_table"
basePath = <<path_to_save_data>>
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'uuid',
'hoodie.datasource.write.partitionpath.field': 'partitionpath',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'ts',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
This creates a PySpark DataFrame but when I try to save it as a Hudi table I get the following error:
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
org.apache.hudi.exception.HoodieException: hoodie only support org.apache.spark.serializer.KryoSerializer as spark.serializer
However, if I run the following command:
print(spark.sparkContext.getConf().get("spark.serializer"))
org.apache.spark.serializer.KryoSerializer
Note that I haven't done:
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
Because I get the error:
AnalysisException: Cannot modify the value of a Spark config: spark.serializer
The cluster I'm working on uses Databricks Runtime version 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12) and I have installed the jar file "hudi_spark3_1_2_bundle_2_12_0_10_1.jar". I don't think it is an issue with the DataLake since I can save the data in another format (parquet, csv). Am I missing something?
Any help would be appreciated. Thanks in advance.