3

we are trying to use talend batch (spark) jobs to access hive in a Kerberos cluster but we are getting the below "Can't get Master Kerberos principal for use as renewer" error.

enter image description here

By using the standard jobs(non spark) in talend we are able to access hive without any issue.

Below are the observation:

  1. When we are running spark jobs talend could able to connect to hive metastore and validating the syntax. ex if I provide the wrong table name it does return "table not found".
  2. when we select count(*) from table where there is no data it returns "NULL" but if some data present in Hdfs(table) It failed with the error "Can't get Master Kerberos principal for use as renewer".

I am not sure exactly what is the issue which is causing the token problem. could some one help us know the root cause.

One more thing to add instead of hive if I read / write to hdfs using spark batch jobs it works , So only problem is with hive and Kerberos.

William R
  • 739
  • 2
  • 13
  • 34
  • 3
    Step 1: just Google that error message i.e. `Can't get Master Kerberos principal for use as renewer` >> Step 2: browse a few answers, involving both Cloudera or HortonWorks distros *(to get some perspective i.e. in Cloudera jargon, "gateway node" simply means "has Hadoop client libs + Hadoop config files")* >> Step 3: understand that you probably miss some **critical config files** such as `core-site.xml`, `hdfs-site.xml`, **`mapred-site.xml`**, **`yarn-site.xml`** and `hive-site.xml` -- Hive spawns MapReduce jobs, remember...! – Samson Scharfrichter May 03 '17 at 21:17
  • @Samson Scharfrichter I really miss something but not sure exactly, I passed entire hive-site.xml , mapred-site.xml to hive connection , yarn-site.xml , core-site.xml , hdfs-site.xml in hdfs connection but still fails with the same error, One more thing to note in log before starting spark job I am getting " HADOOP_HOME or hadoop.home.dir are not set" as warning. – William R May 05 '17 at 08:05
  • 1
    _"I passed entire xxx-site.xml"_ -- passed to *what*? passed *how*? the Hadoop libraries don't care about Spark config, nor Talend config. They rely on their CLASSPATH at run-time; and that run-time sometimes happens *before* the Spark driver has started. That's why `spark-submit` shell would need some env variables such as `HADOOP_CONF_DIR`. With Talend, I guess you need to add directly in the job CLASSPATH the **directory** where the conf files are present. – Samson Scharfrichter May 05 '17 at 09:04
  • ... plus add that directory(ies) also in `spark.driver.extraClassPath` and `spark.executor.extraClassPath` since the conf may also be needed inside the driver and/or the executors. And it has to be a local directory i.e. in YARN-client or YARN-cluster mode, you need to point to `/etc/hadoop/conf:/etc/hive/conf` – Samson Scharfrichter May 05 '17 at 09:07

2 Answers2

1

You should include the hadoop config in the classpath (:/path/hadoop-configuration). You should include all configuration files in that hadoop configuration directory, not only the core-site.xml and hdfs-site.xml files. It happened to me and that solved the problem.

sgalinma
  • 202
  • 1
  • 5
0

the same problem when I start spark on k8s,

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.             
: java.io.IOException: Can't get Master Kerberos principal for use as renewer                                                                               
        at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:133)                                                                                                                                                                                                         
        at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)                                                                                                                                                                                                         
        at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)                           
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:243)                             
        at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)                                              
        at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54)                                                                   
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:273)                                                    
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:269)                                                                              
        at scala.Option.getOrElse(Option.scala:121)                                                                                                         
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:269)

and I just add yarn-site.xml to the HADOOP_CONFIG_DIR.

the yarn-site.xml only contains yarn.resourcemanager.principal

<?xml version="1.0" encoding="UTF-8"?>

<configuration>
 <property>
    <name>yarn.resourcemanager.principal</name>
    <value>yarn/_HOST@DM.COM</value>
  </property>
</configuration>

this working for me.

geosmart
  • 518
  • 4
  • 15