I want to use apache flink on a secure kerberized HDP 3.1 cluster, but am still stuck with the first steps.
The latest release was downloaded and unzipped (https://flink.apache.org/downloads.html#apache-flink-1101)
Now, I try to follow https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/hive/
To integrate with Hive, you need to add some extra dependencies to the /lib/ directory in Flink distribution to make the integration work in Table API program or SQL in SQL Client. Alternatively, you can put these dependencies in a dedicated folder, and add them to classpath with the -C or -l option
Due to the HDP environment:
- the existing configuration resides in:
/usr/hdp/current/hive-server2/conf/hive-site.xml
- the JARs are in
/usr/hdp/current/hive-server2/lib
. Potentially the flink provided jars could be used. But I would prefer to use the Hive Jars directly from the HDP distribution https://github.com/apache/flink/pull/11328/files/b4a76d76d2c1e9722befabc03b2191d053c70fa8#diff-ecb34d0bf175b780ec6ca71da8ec23beR111
How can I:
- tell flink to use / load these to the classpath
- is it somehow possible to omit the first configuration steps (nae, defaultdb, confdir, version) and somehow inferr the atomatically from hive-site.xml?
- start an interactive shell (similarly to a spark-shell, i.e. like flinks interactive sql shell but scala based) in order to follow along with the following steps outlined in the link.
The code
val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build()
val tableEnv = TableEnvironment.create(settings)
val name = "myhive"
val defaultDatabase = "mydatabase"
val hiveConfDir = "/opt/hive-conf" // a local path
val version = "2.3.4"
val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir, version)
tableEnv.registerCatalog("myhive", hive)
// set the HiveCatalog as the current catalog of the session
tableEnv.useCatalog("myhive")
edit
presumably, the following is needed to find the hadoop configuration:
export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf
As well as: https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/hadoop.html I.e.:
export HADOOP_CLASSPATH=$(hadoop classpath)
For now, I still fail to start a flink shell even without any hive support:
cd /path/to/flink-1.10.1/bin
./start-scala-shell.sh
Error: Could not find or load main class org.apache.flink.api.scala.FlinkShell
this preliminary problem seems to be fixable by switching to the older 2.11 version Flink 1.7.2 start-scala-shell.sh cannot find or load main class org.apache.flink.api.scala.FlinkShell
./start-scala-shell.sh local
already works for me to start a local shell.
./start-scala-shell.sh yarn
starts something (locally), but no yarn container is launched.
Meanwhile I have set:
catalogs:
- name: myhive
type: hive
hive-conf-dir: /usr/hdp/current/hive-server2/conf
hive-version: 3.1.2
in the local flink configuration. It is still unclear for me if simply specifying the environment variables mentioned above should make it work automatically.
However, for me the code does not compile as env is not loaded:
scala> env
<console>:68: error: not found: value env
but trying to manually specify
import org.apache.flink.api.scala._
import org.apache.flink.table.api._
import org.apache.flink.table.api.scala._
// environment configuration
val env = ExecutionEnvironment.getExecutionEnvironment
val tEnv = BatchTableEnvironment.create(env)
fails as well with
.UnsupportedOperationException: Execution Environment is already defined for this shell.
edit 2
With:
cd /path/to/flink-1.10.1
export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf
export HADOOP_CLASSPATH=$(hadoop classpath)
./bin/yarn-session.sh --queue my_queue -n 1 -jm 768 -tm 1024
I can successfully start a minimalstic flink cluster on YARN (without the ambari service). Though it would make sense to install the ambari integration.
For now, I could not yet test if / how the interaction with the kerberized Hive and HDFS works. Also, for now, I fail to start an interactive shell - as outlined below.
In fact, even in a playground non kerberized enviornment I observe issues with flinks interactive shell flink start scala shell - numberformat exepction
edit 3
I do not know what changed but with:
cd /home/at/heilerg/development/software/flink-1.10.1
export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf
export HADOOP_CLASSPATH=$(hadoop classpath)
./bin/start-scala-shell.sh local
btenv.listDatabases
//res12: Array[String] = Array(default_database)
btenv.listTables
//res9: Array[String] = Array()
I can get hold of a batch table environment in local mode. Currently, no tables or databases from hive are present though.
NOTE: the configuration is set up as follows:
catalogs: - name: hdphive type: hive hive-conf-dir: /usr/hdp/current/hive-server2/conf hive-version: 3.1.2
When instead trying to use code over configuration, I cannot import the HiveCatalog
:
val name = "hdphive"
val defaultDatabase = "default"
val hiveConfDir = "/usr/hdp/current/hive-server2/conf" // a local path
val version = "3.1.2" //"2.3.4"
import org.apache.flink.table.catalog.hive.HiveCatalog
// for now, I am failing here
// <console>:67: error: object hive is not a member of package org.apache.flink.table.catalog
// import org.apache.flink.table.catalog.hive.HiveCatalog
val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir, version)
tableEnv.registerCatalog(name, hive)
tableEnv.useCatalog(name)
btenv.listDatabases
These jars were manually put into the lib
directory:
- https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-3-uber/3.1.1.7.0.3.0-79-7.0
- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-hive_2.11/1.10.1
Regardless of the version of hive's jars I face missing Hive classes:
val version = "3.1.2" // or "3.1.0" // has the same problem
import org.apache.flink.table.catalog.hive.HiveCatalog
val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir, version)
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/NoSuchObjectException
... 30 elided
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.metastore.api.NoSuchObjectException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 30 more
But ins't: export HADOOP_CLASSPATH=$(hadoop classpath)
used to load the HDP classes?
Anyways:
cp /usr/hdp/current/hive-client/lib/hive-exec-3.1.0.<<<version>>>.jar /path/to/flink-1.10.1/lib
means that I am one step further:
val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir, version)
btenv.registerCatalog(name, hive)
Caused by: java.lang.ClassNotFoundException: com.facebook.fb303.FacebookService$Iface
After adding https://repo1.maven.org/maven2/org/apache/thrift/libfb303/0.9.3/ into the lib
directory
btenv.registerCatalog(name, hive)
Doesn't complain with a class not found exception, but execution seems to be stuck at this step for several minutes. Then, it fails with a kerberos exception:
Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: GSS initiate failed
I just realized that:
<property>
<name>hive.metastore.kerberos.principal</name>
<value>hive/_HOST@xxxx</value>
</property>
and
klist
Default principal: user@xxxx
That here the principal would not match the one from hive-site.xml. However, spark can read the metastore just fine with the same configuration and principal mismatch just outlined here.