Async Clustering failing for MOR table with object not serializable (class: org.apache.avro.generic.GenericData$Record error

Question

Problem Description

We have a MOR table which is partitioned by yearmonth(yyyyMM). We would like to trigger async clustering after doing the compaction at the end of the day so that we can stitch together small files into larger files. Async clustering for the table is failing with object not serializable (class: org.apache.avro.generic.GenericData$Record. Below are the different approaches I tried and the error messages I got.

Hudi Config Used

"hoodie.table.name" -> hudiTableName,
"hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.precombine.field" -> preCombineKey,
"hoodie.datasource.write.recordkey.field" -> recordKey,
"hoodie.datasource.write.operation" -> writeOperation,
"hoodie.datasource.write.row.writer.enable" -> "true",
"hoodie.datasource.write.reconcile.schema" -> "true",
"hoodie.datasource.write.partitionpath.field" -> partitionColumnName,
"hoodie.datasource.write.hive_style_partitioning" -> "true",
"hoodie.bulkinsert.sort.mode" -> "GLOBAL_SORT",
"hoodie.datasource.hive_sync.enable" -> "true",
"hoodie.datasource.hive_sync.table" -> hudiTableName,
"hoodie.datasource.hive_sync.database" -> databaseName,
"hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName,
"hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.use_jdbc" -> "false",
"hoodie.combine.before.upsert" -> "true",
"hoodie.index.type" -> "BLOOM",
"spark.hadoop.parquet.avro.write-old-list-structure" -> "false"
"hoodie.datasource.write.table.type" -> "MERGE_ON_READ"
"hoodie.compact.inline" -> "false",
"hoodie.compact.schedule.inline" -> "true",
"hoodie.compact.inline.trigger.strategy" -> "NUM_COMMITS",
"hoodie.compact.inline.max.delta.commits" -> "5",
"hoodie.cleaner.policy" -> "KEEP_LATEST_COMMITS",
"hoodie.cleaner.commits.retained" -> "3",
"hoodie.clustering.async.enabled" -> "true",
"hoodie.clustering.async.max.commits" -> "2",
"hoodie.clustering.execution.strategy.class" -> "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
"hoodie.clustering.plan.strategy.sort.columns" -> recordKey,
"hoodie.clustering.plan.strategy.small.file.limit" -> "67108864",
"hoodie.clustering.plan.strategy.target.file.max.bytes" -> "134217728",
"hoodie.clustering.plan.strategy.max.bytes.per.group" -> "2147483648",
"hoodie.clustering.plan.strategy.max.num.groups" -> "150",
"hoodie.clustering.preserve.commit.metadata" -> "true"

Approaches Tried

Triggered a clustering job with running mode as scheduleAndExecute Code Used

hudiClusterConfig.basePath = <table-path>
hudiClusterConfig.tableName = <table-name>
hudiClusterConfig.runningMode = "scheduleAndExecute"
hudiClusterConfig.retryLastFailedClusteringJob = true
val configList: util.List[String] = new util.ArrayList()
configList.add("hoodie.clustering.async.enabled=true")
configList.add("hoodie.clustering.async.max.commits=2") configList.add("hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy")
configList.add("hoodie.clustering.plan.strategy.sort.columns=<sort-columns>")
configList.add("hoodie.clustering.plan.strategy.small.file.limit=67108864")
configList.add("hoodie.clustering.plan.strategy.target.file.max.bytes=134217728")
configList.add("hoodie.clustering.plan.strategy.max.bytes.per.group=2147483648")
configList.add("hoodie.clustering.plan.strategy.max.num.groups=150")
configList.add("hoodie.clustering.preserve.commit.metadata=true")
hudiClusterConfig.configs = configList
val hudiClusterJob = new HoodieClusteringJob(jsc, hudiClusterConfig)
val clusterStatus = hudiClusterJob.cluster(1)
println(clusterStatus)

Stacktrace

ShuffleMapStage 87 (sortBy at RDDCustomColumnsSortPartitioner.java:64) failed in 1.098 s due to Job aborted due to stage failure: task 0.0 in stage 28.0 (TID 367) had a not serializable result: org.apache.avro.generic.GenericData$Record Serialization stack:

object not serializable (class: org.apache.avro.generic.GenericData$Record, value:

Used the procedure run_clustering to schedule and trigger clustering. We found that the replacecommit created through the procedure run had lesser data compared to what it was created when scheduled from the code in approach 1 Code Used

query_run_clustering = f"call run_clustering(path => '{path}')"
spark_df_run_clustering = spark.sql(query_run_clustering)
spark_df_run_clustering.show()

Stacktrace

An error occurred while calling o97.sql. : org.apache.hudi.exception.HoodieClusteringException: Clustering failed to write to files:c94cb139-70cf-4195-ad87-c56527ab5ccf-0,bc2c65f1-39fc-4879-ba83-5003fc9757b0-0,7e699100-39a3-46f7-ac7d-42e9cfaad2e1-0,a6076357-8a7f-4ae1-b6ec-2dd509d9818e-0,9a6752a4-1bcb-4dfb-ad82-80877d07cbdc-0,e5573f8c-c5bc-45b4-a670-1bcd9257726d-0,b00372f1-bd6d-4e46-9add-0ceca84f005a-0,6eb6bc42-b086-4aa0-a899-0b0ff602b7bf-0,35a06cda-57df-457f-aa8c-4792fd52cf33-0,78c75d85-ab08-4e97-9127-6b350d07e8f8-0,18ed0a15-9d42-495b-a43c-140b08dbc852-0,e2f5f9da-0717-4b8e-95b3-09639f2fc4a9-0,700a07e2-2114-4d50-9673-0e3dc885da55-0,1836db85-1320-4ff8-8aea-fc5dbbe267c7-0,b6c0eb8a-fd1e-40e6-bc8c-3e3b6180d916-0,225b791e-ac7b-4a6d-a295-e547c3e6a558-0,e567f6fb-bf27-496a-9c67-d26a5824870e-0,7a40f1c3-c3f5-433f-9cb8-5773de8d9557-0,b4f336b9-6669-4510-a2eb-c300fdae2320-0,1f4ef584-c199-449a-ba82-19b79531432e-0,b3b06f51-32e5-4a94-9ffe-035c08ae7f50-0,debcc1fc-8a67-4a0b-8691-d28b96c0403a-0,c40a0b32-8394-4c0c-8d41-a58e247e44c9-0,942b69c8-a292-4ba6-86a6-9c3e344a9cd6-0,80f06951-1497-4cca-861e-22addd451ddb-0,2eb68890-154a-4963-90fd-47a1a32dceaf-0,5f05cffc-7a4b-4817-8e3e-14905fd81b9b-0,1acba9bf-1ef8-40e8-8a1d-7a54ebc6387e-0,008fd3cc-987b-4855-8125-b5d0529a26a1-0,dfaf9d4c-f23e-49d4-98df-078622fb9383-0 at org.apache.hudi.client.SparkRDDWriteClient.completeClustering(SparkRDDWriteClient.java:381)

Would appreciate if anybody can provide some suggestion.

Expected behavior

Clustering should stitch together the smaller files

Environment Description

Platform: AWS Glue v4.0

Hudi version : 0.12.1

Spark version : 3.3

Storage (HDFS/S3/GCS..) : S3

Running on Docker? (yes/no) : no

Async Clustering failing for MOR table with object not serializable (class: org.apache.avro.generic.GenericData$Record error

0 Answers0