1

Apache Hudi version 0.13.0 Spark version 3.3.2

I'm very new to Hudi and Minio and have been trying to write a table from local database to Minio in Hudi format. I'm using overwrite save mode for the upload. While the table is written successfully for the first run, any further runs of the script cause an error. I'm able to write the table multiple times using append with the same configurations, but using overwrite throws an error after the first time.

[error] org.apache.hudi.exception.HoodieIOException: Could not load Hoodie properties from s3a://hudi/status_device_view/test7/.hoodie/metadata/.hoodie/hoodie.properties
[error]         at org.apache.hudi.common.table.HoodieTableConfig.<init>(HoodieTableConfig.java:289)
[error]         at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:138)
[error]         at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:689)
[error]         at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:81)
[error]         at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:770)
[error]         at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.enablePartitions(HoodieBackedTableMetadataWriter.java:202)
[error]         at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.<init>(HoodieBackedTableMetadataWriter.java:177)
[error]         at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.<init>(SparkHoodieBackedTableMetadataWriter.java:104)
[error]         at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:79)
[error]         at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:341)
[error]         at org.apache.hudi.client.SparkRDDWriteClient.initMetadataTable(SparkRDDWriteClient.java:330)
[error]         at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1133)
[error]         at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1169)
[error]         at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1198)
[error]         at org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:162)
[warn] In the last 10 seconds, 6.778 (70.4%) were spent in GC. [Heap: 0.46GB free of 0.94GB, max 0.94GB] Consider increasing the JVM heap using `-Xmx` or try a different collector, e.g. `-XX:+UseG1GC`, for better performance.
[error]         at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:204)
[error]         at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:363)
[error]         at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
[error]         at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
[error]         at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
[error]         at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
[error]         at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
[error]         at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
[error]         at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
[error]         at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
[error]         at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
[error]         at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
[error]         at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
[error]         at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
[error]         at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
[error]         at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
[error]         at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
[error]         at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
[error]         at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
[error]         at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[error]         at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
[error]         at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
[error]         at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
[error]         at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
[error]         at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:116)
[error]         at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860)
[error]         at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
[error]         at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
[error]         at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
[error]         at minio$.main(minio.scala:94)
[error]         at minio.main(minio.scala)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.lang.reflect.Method.invoke(Method.java:498)
[error] Caused by: java.io.FileNotFoundException: No such file or directory: s3a://hudi/status_device_view/test7/.hoodie/metadata/.hoodie/hoodie.properties.backup
[error]         at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3866)
[error]         at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3688)
[error]         at org.apache.hadoop.fs.s3a.S3AFileSystem.extractOrFetchSimpleFileStatus(S3AFileSystem.java:5401)
[error]         at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1465)
[error]         at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:1441)
[error]         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976)
[error]         at org.apache.hudi.common.fs.HoodieWrapperFileSystem.open(HoodieWrapperFileSystem.java:476)
[error]         at org.apache.hudi.common.table.HoodieTableConfig.fetchConfigs(HoodieTableConfig.java:343)
[error]         at org.apache.hudi.common.table.HoodieTableConfig.<init>(HoodieTableConfig.java:270)
[error]         at org.apache.hudi.common.table.HoodieTableMetaClient.<init>(HoodieTableMetaClient.java:138)
[error]         at org.apache.hudi.common.table.HoodieTableMetaClient.newMetaClient(HoodieTableMetaClient.java:689)
[error]         at org.apache.hudi.common.table.HoodieTableMetaClient.access$000(HoodieTableMetaClient.java:81)
[error]         at org.apache.hudi.common.table.HoodieTableMetaClient$Builder.build(HoodieTableMetaClient.java:770)
[error]         at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.enablePartitions(HoodieBackedTableMetadataWriter.java:202)
[error]         at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.<init>(HoodieBackedTableMetadataWriter.java:177)
[error]         at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.<init>(SparkHoodieBackedTableMetadataWriter.java:104)
[error]         at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:79)
[error]         at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:341)
[error]         at org.apache.hudi.client.SparkRDDWriteClient.initMetadataTable(SparkRDDWriteClient.java:330)
[error]         at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1133)
[error]         at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1169)
[error]         at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1198)
[error]         at org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:162)
[error]         at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:204)
[error]         at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:363)
[error]         at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
[error]         at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
[error]         at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
[error]         at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
[error]         at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
[error]         at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
[error]         at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
[error]         at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
[error]         at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
[error]         at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
[error]         at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
[error]         at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
[error]         at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
[error]         at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
[error]         at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
[error]         at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
[error]         at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
[error]         at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[error]         at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
[error]         at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
[error]         at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
[error]         at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
[error]         at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
[error]         at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:116)
[error]         at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860)
[error]         at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
[error]         at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
[error]         at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
[error]         at minio$.main(minio.scala:94)
[error]         at minio.main(minio.scala)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.lang.reflect.Method.invoke(Method.java:498)
[error] stack trace is suppressed; run last Compile / run for the full output
[error] (Compile / run) org.apache.hudi.exception.HoodieIOException: Could not load Hoodie properties from s3a://hudi/status_device_view/test7/.hoodie/metadata/.hoodie/hoodie.properties

The code used to write the table is

    status_device_join.write.format("org.apache.hudi")
      .options(Config)
      .mode("overwrite")
      .save("s3a://hudi/status_device_view/test7")

And the configurations I used are

    val Config = scala.collection.mutable.Map(
                  "className"->"org.apache.hudi",
                  "hoodie.datasource.hive_sync.use_jdbc" -> "false",
                  "hoodie.datasource.write.recordkey.field" -> "serial_number",
                  "hoodie.datasource.write.partitionpath.field" -> "",
                  "hoodie.datasource.write.precombine.field" -> "mac_address",
                  "hoodie.table.name" -> "status_device_join",
                  "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.NonPartitionedExtractor",
                  "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.NonpartitionedKeyGenerator",
                  "hoodie.datasource.write.operation" -> "insert",
                  "hoodie.filesystem.view.remote.retry.enable" -> "true",
                  "hoodie.embed.timeline.server" -> "false",
                  "hoodie.metadata.enable" -> "true",
                  "hoodie.clustering.preserve.commit.metadata" -> "true",
                  "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE"
                  )

This is the spark session I created

    val sc = SparkSession.builder().master("local").appName("minio").
      config("spark.driver.extraClassPath", "/Users/Downloads/postgresql-42.6.0.jar").
      config("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
      config("spark.sql.hive.convertMetastoreParquet", value = false).
      config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog").
      config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension").
      config("spark.hadoop.spark.sql.legacy.parquet.nanosAsLong", value = false).
      getOrCreate()

I expected the old table to be overwritten by the new write, but I get the error mentioned above. Appreciate any help on this error

Adi
  • 11
  • 1

1 Answers1

0

In hudi the way to overwrite the table with the incoming dataframe is to keep Savemode.Append and set hoodie.datasource.write.operation=insert_overwrite_table

By the way you can also want to replace only the partitions within the incoming dataframe and then use the insert_overwrite operation.

parisni
  • 920
  • 7
  • 20