3

I have a standalone Flink installation on top of which I want to run a streaming job that is writing data into a HDFS installation. The HDFS installation is part of a Cloudera deployment and requires Kerberos authentication in order to read and write the HDFS. Since I found no documentation on how to make Flink connect with a Kerberos-protected HDFS I had to make some educated guesses about the procedure. Here is what I did so far:

  • I created a keytab file for my user.
  • In my Flink job, I added the following code:

    UserGroupInformation.loginUserFromKeytab("myusername", "/path/to/keytab");
    
  • Finally I am using a TextOutputFormatto write data to the HDFS.

When I run the job, I'm getting the following error:

org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBE
ROS]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
        at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
        at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1730)
        at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1668)
        at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1593)
        at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:397)
        at org.apache.hadoop.hdfs.DistributedFileSystem$6.doCall(DistributedFileSystem.java:393)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:393)
        at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:337)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
        at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:405)

For some odd reason, Flink seems to try SIMPLE authentication, even though I called loginUserFromKeytab. I found another similar issue on Stackoverflow (Error with Kerberos authentication when executing Flink example code on YARN cluster (Cloudera)) which had an answer explaining that:

Standalone Flink currently only supports accessing Kerberos secured HDFS if the user is authenticated on all worker nodes.

That may mean that I have to do some authentication at the OS level e.g. with kinit. Since my knowledge of Kerberos is very limited I have no idea how I would do it. Also I would like to understand how the program running after kinit actually knows which Kerberos ticket to pick from the local cache when there is no configuration whatsoever regarding this.

Community
  • 1
  • 1
Jan Thomä
  • 13,296
  • 6
  • 55
  • 83

3 Answers3

5

I'm not a Flink user, but based on what I've seen with Spark & friends, my guess is that "Authenticated on all worker nodes" means that each worker process has

  1. a core-site.xml config available on local fs with hadoop.security.authentication set to kerberos (among other things)

  2. the local dir containing core-site.xml added to the CLASSPATH so that it is found automatically by the Hadoop Configuration object [it will revert silently to default hard-coded values otherwise, duh]

  3. implicit authentication via kinit and the default cache [TGT set globally for the Linux account, impacts all processes, duh] ## or ## implicit authentication via kinit and a "private" cache set thru KRB5CCNAME env variable (Hadoop supports only "FILE:" type) ## or ## explicit authentication via UserGroupInformation.loginUserFromKeytab() and a keytab available on the local fs

That UGI "login" method is incredibly verbose, so if it was indeed called before Flink tries to initiate the HDFS client from the Configuration, you will notice. On the other hand, if you don't see the verbose stuff, then your attempt to create a private Kerberos TGT is bypassed by Flink, and you have to find a way to bypass Flink :-/

Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36
  • Ah, for a long-running Streaming job, there's also the issue of renewing the Kerberos TGT -- see http://stackoverflow.com/questions/33211134/hbase-kerberos-connection-renewal-strategy/33243360#33243360 – Samson Scharfrichter Jan 04 '16 at 18:44
  • You can achieve point 3 (last "or") via flink-conf.yaml which supports Kerberos options, which is convenient to decouple code from env settings if you have multiple deployments: https://ci.apache.org/projects/flink/flink-docs-stable/ops/security-kerberos.html – Alessandro S. Jan 09 '20 at 08:47
3

You can also configure your stand alone cluster to handle authentication for you without additional code in your jobs.

  1. Export HADOOP_CONF_DIR and point it to directory where core-site.xml and hdfs-site.xml is located
  2. Add to flink-conf.yml
security.kerberos.login.use-ticket-cache: false
security.kerberos.login.keytab: <path to keytab>
security.kerberos.login.principal: <principal>
env.java.opts: -Djava.security.krb5.conf=<path to krb5 conf>
  1. Add pre-bundled Hadoop to lib directory of your cluster https://flink.apache.org/downloads.html

The only dependencies you should need in your jobs is:

compile "org.apache.flink:flink-java:$flinkVersion"
compile "org.apache.flink:flink-clients_2.11:$flinkVersion"
compile 'org.apache.hadoop:hadoop-hdfs:$hadoopVersion'
compile 'org.apache.hadoop:hadoop-client:$hadoopVersion'
literg
  • 482
  • 5
  • 13
1

In order to access a secured HDFS or HBase installation from a standalone Flink installation, you have to do the following:

  • Log into the server running the JobManager, authenticate against Kerberos using kinit and start the JobManager (without logging out or switching the user in between).
  • Log into each server running a TaskManager, authenticate again using kinit and start the TaskManager (again, with the same user).
  • Log into the server from where you want to start your streaming job (often, its the same machine running the JobManager), log into Kerberos (with kinit) and start your job with /bin/flink run.

In my understanding, kinit is logging in the current user and creating a file somewhere in /tmp with some login data. The mostly static class UserGroupInformation is looking up that file with the login data when its loaded the first time. If the current user is authenticated with Kerberos, the information is used to authenticate against HDFS.

Robert Metzger
  • 4,452
  • 23
  • 50
  • 1
    The default cache file is indeed `/tmp/krb5cc_` (as you can see with `klist`), but actually, you can use a *private* ticket cache for your app -- i.e. you can run multiple apps on the same node with the same Linux account but different Kerberos principals -- by setting `KRB5CCNAME` environment variable, as long as it's a "FILE:" type. – Samson Scharfrichter Jan 04 '16 at 18:47