Access multiple hadoop namenodes from Spark on YARN?

Question

I am working with a system that Spark jobs run on YARN with one default Hadoop namenode. Recently, I have added another Hadoop namenodes to my system. I now want the Spark jobs read the input data from the default namenode and write the output to the second one. How can I config or specific the path in the Spark jobs? I tried to put the hdfs path to the code, for example:

 val spark = SparkSession.builder().appName("Example").getOrCreate();
 val input = spark.read.parquet("hdfs://defaultnamenode:9000/sample.parquet")
 input.write.parquet("hdfs://secondnamenode:9000/sample")

But it threw the exception:

17/12/18 10:49:42 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: path hdfs://secondnamenode:9000/sample already exists.;
org.apache.spark.sql.AnalysisException: path hdfs://secondnamenode:9000/sample already exists.;
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:106)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
        at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
        at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
        at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:509)

And there is no result at all :(

What do you think is the purpose of a namenode in a HDFS cluster? You've got two namenodes in a single cluster, haven't you? — Jacek Laskowski, Dec 18 '17 at 10:02
@philantrovert: I din't set permissions carefully in the old namenode, and it will effect many projects based on the old namenode if I change the permissions now. That's why I want to add the new one. — Quy Doan, Dec 19 '17 at 07:41

Access multiple hadoop namenodes from Spark on YARN?

0 Answers0