Spark in kerberized Hadoop environment and High Availability enabled: Spark SQL can only read data after write task

Question

We were using a kerberized Hadoop environment (HDP 3.1.4 with Spark 2.3.2 and Ambari 2.7.4) for a long time, everything went well so far.

Now we enabled NameNode high availability and have the following issue: When we want to read data using Spark SQL, we first have to write some (other) data. If we don't write something before the read operation, it fails.

Here our scenario:

$ kinit -kt /etc/security/keytabs/user.keytab user
$ spark-shell

Run a Read request -> This first read request per session fails!

scala> spark.sql("SELECT * FROM pm.simulation_uci_hydraulic_sensor").show
Hive Session ID = cbb6b6e2-a048-41e0-8e77-c2b2a7f52dbe
[Stage 0:>                                                          (0 + 1) / 1]20/04/22 15:04:53 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, had-data6.my-company.de, executor 2): java.io.IOException: DestHost:destPort had-job.my-company.de:8020 , LocalHost:localPort had-data6.my-company.de/192.168.178.123:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:806)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1502)
        at org.apache.hadoop.ipc.Client.call(Client.java:1444)
        at org.apache.hadoop.ipc.Client.call(Client.java:1354)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy13.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:862)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:851)
        at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:840)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1004)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:320)
        at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:316)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:328)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
        at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:522)
        at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:364)
        at org.apache.orc.OrcFile.createReader(OrcFile.java:251)
        [...]

Run a Write job -> This works!

scala> val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> primitiveDS.write.saveAsTable("pm.todelete3")
20/04/22 15:05:07 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.

Now, do this the same read again -> It works (for same session)!?

scala> spark.sql("SELECT * FROM pm.simulation_uci_hydraulic_sensor").show
+--------+--------+--------------------+------+
|instance|sensorId|                  ts| value|
+--------+--------+--------------------+------+
|      21|     PS6|2020-04-18 17:19:...| 8.799|
|      21|    EPS1|2020-04-18 17:19:...|2515.6|
|      21|     PS3|2020-04-18 17:19:...| 2.187|
+--------+--------+--------------------+------+

When running a new spark-shell session, same behavior!

Can someone help with this issue? Thank you!

score 1 · Answer 1 · answered Apr 22 '20 at 15:39

We found the answer for the problem: The table properties pointed to the "old" NameNode location in the table that was created before activating High Availability in the Hadoop cluster.

You can find table information by running the following command:

$ spark-shell
scala> spark.sql("DESCRIBE EXTENDED db.table").show(false)

This shows Table information like in my case:

+----------------------------+---------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                    |comment|
+----------------------------+---------------------------------------------------------------------------------------------+-------+
|instance                    |int                                                                                          |null   |
|sensorId                    |string                                                                                       |null   |
|ts                          |timestamp                                                                                    |null   |
|value                       |double                                                                                       |null   |
|                            |                                                                                             |       |
|# Detailed Table Information|                                                                                             |       |
|Database                    |simulation                                                                                   |       |
|Table                       |uci_hydraulic_sensor_1                                                                       |       |                                                                                                                          |       |
|Created By                  |Spark 2.3.2.3.1.4.0-315                                                                      |       |
|Type                        |EXTERNAL                                                                                     |       |
|Provider                    |parquet                                                                                      |       |
|Statistics                  |244762020 bytes                                                                              |       |
|Location                    |hdfs://had-job.mycompany.de:8020/projects/pm/simulation/uci_hydraulic_sensor_1       <== This is important!
|Serde Library               |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe                                  |       |
|InputFormat                 |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat                                |       |
|OutputFormat                |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat                               |       |
+----------------------------+---------------------------------------------------------------------------------------------+-------+

To set the new table location with the HA cluster service name run following SQL:

$ spark-shell
scala> spark.sql("ALTER TABLE simulation.uci_hydraulic_sensor_1 SET LOCATION 'hdfs://my-ha-name/projects/pm/simulation/uci_hydraulic_sensor_1'")

In further Spark sessions the table read works fine!

Spark in kerberized Hadoop environment and High Availability enabled: Spark SQL can only read data after write task

1 Answers1