1

I'm running Apache Hudi application on Apache Spark. While I'm submitting an application in client mode its working fine but when I'm submitting an application in cluster mode, getting an error

py4j.protocol.Py4JJavaError: An error occurred while calling o196.save.
: org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive connection jdbc:hive2://localhost:10000/
    at org.apache.hudi.hive.HoodieHiveClient.createHiveConnection(HoodieHiveClient.java:422)
    at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:95)
    at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:66)
    at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:321)
    at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:363)
    at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:359)
    at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
    at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:359)
    at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:417)
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:205)
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:173)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:169)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:197)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:194)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:112)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
    at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$executeQuery$1(SQLExecution.scala:83)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1$$anonfun$apply$1.apply(SQLExecution.scala:94)
    at org.apache.spark.sql.execution.QueryExecutionMetrics$.withMetrics(QueryExecutionMetrics.scala:141)
    at org.apache.spark.sql.execution.SQLExecution$.org$apache$spark$sql$execution$SQLExecution$$withMetrics(SQLExecution.scala:178)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:93)
Pradeep Saini
  • 680
  • 6
  • 17

2 Answers2

3

After modifying hudi config which "hoodie.datasource.hive_sync.jdbcurl" its start working.

Pradeep Saini
  • 680
  • 6
  • 17
  • 2
    hoodie.datasource.hive_sync.jdbcurl should be given master DNS hostname, when working with EMR and using spark-submit in cluster mode @PradeepSaini – Vikas Chitturi Oct 07 '21 at 02:49
  • @VikasChitturi tried to use master DNS hostname - the same issue. – dytyniak Oct 07 '21 at 16:41
  • 1
    Ok, can you share the exception from yarn logs? master DNS name should look like this: jdbc:hive2://ip-XX-XXX-XX-XX.ec2.internal:10000 @dytyniak – Vikas Chitturi Oct 10 '21 at 16:37
  • @VikasChitturi An error occurred while calling o107.save. : org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive connection jdbc:hive2://ip-172-31-35-94.ec2.internal:10000/ – dytyniak Oct 11 '21 at 11:23
  • maybe check your cluster configuration? is it operating in subnet? could be firewall issues? not sure because it worked for me, my EMR has proper security groups enabled for master and worker instances – Vikas Chitturi Oct 11 '21 at 13:25
  • @dytyniak please check my configuration in below answer – Vikas Chitturi Oct 11 '21 at 13:35
  • @VikasChitturi I am using default config, yes in subnet. – dytyniak Oct 11 '21 at 15:13
  • @dytyniak you don't have to use jdbc, 'hoodie.datasource.hive_sync.use_jdbc':"false", 'hoodie.datasource.hive_sync.mode':"hms", try it with these 2 configs, that way you don't have to pass jbdc url – Vikas Chitturi Dec 01 '21 at 07:37
0

Following are the hudi write options I am using which worked provided that EMR cluster is configured correctly with proper security groups and subnet settings

hudi_write_table_options = {
        "hoodie.table.name": "hudi_data_test",
        "hoodie.datasource.write.table.type": "MERGE_ON_READ",
        "hoodie.datasource.write.storage.type": "MERGE_ON_READ",
        "hoodie.datasource.write.recordkey.field": ['a','b'],
        "hoodie.datasource.write.partitionpath.field": ['a','b'],
        "hoodie.datasource.write.precombine.field": 'c',
        "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
        "hoodie.datasource.write.operation": "bulk_insert",
        "hoodie.consistency.check.enabled": "true",
        "hoodie.datasource.write.hive_style_partitioning": "true",
        "hoodie.datasource.hive_sync.enable": "true",
        "hoodie.datasource.hive_sync.auto_create_database":"true",
        "hoodie.datasource.hive_sync.database":"hudidatabase",
        "hoodie.datasource.hive_sync.table": "hudi_data_test",
        "hoodie.datasource.hive_sync.partition_fields": ['a','b'],
        'hoodie.datasource.hive_sync.jdbcurl':"jdbc:hive2://ip-XXX-XX-XX-XX.ec2.internal:10000/",
        "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor"
    }