Problem Description
We have a MOR table which is partitioned by yearmonth(yyyyMM). We would like to trigger async clustering after doing the compaction at the end of the day so that we can stitch together small files into larger files. Async clustering for the table is failing with object not serializable (class: org.apache.avro.generic.GenericData$Record. Below are the different approaches I tried and the error messages I got.
Hudi Config Used
"hoodie.table.name" -> hudiTableName,
"hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.precombine.field" -> preCombineKey,
"hoodie.datasource.write.recordkey.field" -> recordKey,
"hoodie.datasource.write.operation" -> writeOperation,
"hoodie.datasource.write.row.writer.enable" -> "true",
"hoodie.datasource.write.reconcile.schema" -> "true",
"hoodie.datasource.write.partitionpath.field" -> partitionColumnName,
"hoodie.datasource.write.hive_style_partitioning" -> "true",
"hoodie.bulkinsert.sort.mode" -> "GLOBAL_SORT",
"hoodie.datasource.hive_sync.enable" -> "true",
"hoodie.datasource.hive_sync.table" -> hudiTableName,
"hoodie.datasource.hive_sync.database" -> databaseName,
"hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName,
"hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.use_jdbc" -> "false",
"hoodie.combine.before.upsert" -> "true",
"hoodie.index.type" -> "BLOOM",
"spark.hadoop.parquet.avro.write-old-list-structure" -> "false"
"hoodie.datasource.write.table.type" -> "MERGE_ON_READ"
"hoodie.compact.inline" -> "false",
"hoodie.compact.schedule.inline" -> "true",
"hoodie.compact.inline.trigger.strategy" -> "NUM_COMMITS",
"hoodie.compact.inline.max.delta.commits" -> "5",
"hoodie.cleaner.policy" -> "KEEP_LATEST_COMMITS",
"hoodie.cleaner.commits.retained" -> "3",
"hoodie.clustering.async.enabled" -> "true",
"hoodie.clustering.async.max.commits" -> "2",
"hoodie.clustering.execution.strategy.class" -> "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
"hoodie.clustering.plan.strategy.sort.columns" -> recordKey,
"hoodie.clustering.plan.strategy.small.file.limit" -> "67108864",
"hoodie.clustering.plan.strategy.target.file.max.bytes" -> "134217728",
"hoodie.clustering.plan.strategy.max.bytes.per.group" -> "2147483648",
"hoodie.clustering.plan.strategy.max.num.groups" -> "150",
"hoodie.clustering.preserve.commit.metadata" -> "true"
Approaches Tried
Triggered a clustering job with running mode as scheduleAndExecute Code Used
hudiClusterConfig.basePath = <table-path>
hudiClusterConfig.tableName = <table-name>
hudiClusterConfig.runningMode = "scheduleAndExecute"
hudiClusterConfig.retryLastFailedClusteringJob = true
val configList: util.List[String] = new util.ArrayList()
configList.add("hoodie.clustering.async.enabled=true")
configList.add("hoodie.clustering.async.max.commits=2") configList.add("hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy")
configList.add("hoodie.clustering.plan.strategy.sort.columns=<sort-columns>")
configList.add("hoodie.clustering.plan.strategy.small.file.limit=67108864")
configList.add("hoodie.clustering.plan.strategy.target.file.max.bytes=134217728")
configList.add("hoodie.clustering.plan.strategy.max.bytes.per.group=2147483648")
configList.add("hoodie.clustering.plan.strategy.max.num.groups=150")
configList.add("hoodie.clustering.preserve.commit.metadata=true")
hudiClusterConfig.configs = configList
val hudiClusterJob = new HoodieClusteringJob(jsc, hudiClusterConfig)
val clusterStatus = hudiClusterJob.cluster(1)
println(clusterStatus)
Stacktrace
ShuffleMapStage 87 (sortBy at RDDCustomColumnsSortPartitioner.java:64) failed in 1.098 s due to Job aborted due to stage failure: task 0.0 in stage 28.0 (TID 367) had a not serializable result: org.apache.avro.generic.GenericData$Record Serialization stack:
- object not serializable (class: org.apache.avro.generic.GenericData$Record, value:
Used the procedure run_clustering to schedule and trigger clustering. We found that the replacecommit created through the procedure run had lesser data compared to what it was created when scheduled from the code in approach 1 Code Used
query_run_clustering = f"call run_clustering(path => '{path}')"
spark_df_run_clustering = spark.sql(query_run_clustering)
spark_df_run_clustering.show()
Stacktrace
An error occurred while calling o97.sql. : org.apache.hudi.exception.HoodieClusteringException: Clustering failed to write to files:c94cb139-70cf-4195-ad87-c56527ab5ccf-0,bc2c65f1-39fc-4879-ba83-5003fc9757b0-0,7e699100-39a3-46f7-ac7d-42e9cfaad2e1-0,a6076357-8a7f-4ae1-b6ec-2dd509d9818e-0,9a6752a4-1bcb-4dfb-ad82-80877d07cbdc-0,e5573f8c-c5bc-45b4-a670-1bcd9257726d-0,b00372f1-bd6d-4e46-9add-0ceca84f005a-0,6eb6bc42-b086-4aa0-a899-0b0ff602b7bf-0,35a06cda-57df-457f-aa8c-4792fd52cf33-0,78c75d85-ab08-4e97-9127-6b350d07e8f8-0,18ed0a15-9d42-495b-a43c-140b08dbc852-0,e2f5f9da-0717-4b8e-95b3-09639f2fc4a9-0,700a07e2-2114-4d50-9673-0e3dc885da55-0,1836db85-1320-4ff8-8aea-fc5dbbe267c7-0,b6c0eb8a-fd1e-40e6-bc8c-3e3b6180d916-0,225b791e-ac7b-4a6d-a295-e547c3e6a558-0,e567f6fb-bf27-496a-9c67-d26a5824870e-0,7a40f1c3-c3f5-433f-9cb8-5773de8d9557-0,b4f336b9-6669-4510-a2eb-c300fdae2320-0,1f4ef584-c199-449a-ba82-19b79531432e-0,b3b06f51-32e5-4a94-9ffe-035c08ae7f50-0,debcc1fc-8a67-4a0b-8691-d28b96c0403a-0,c40a0b32-8394-4c0c-8d41-a58e247e44c9-0,942b69c8-a292-4ba6-86a6-9c3e344a9cd6-0,80f06951-1497-4cca-861e-22addd451ddb-0,2eb68890-154a-4963-90fd-47a1a32dceaf-0,5f05cffc-7a4b-4817-8e3e-14905fd81b9b-0,1acba9bf-1ef8-40e8-8a1d-7a54ebc6387e-0,008fd3cc-987b-4855-8125-b5d0529a26a1-0,dfaf9d4c-f23e-49d4-98df-078622fb9383-0 at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:381)
Would appreciate if anybody can provide some suggestion.
Expected behavior
Clustering should stitch together the smaller files
Environment Description
Platform: AWS Glue v4.0
Hudi version : 0.12.1
Spark version : 3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no