1

I am trying to get a spark cluster to write to SQL server using JavaKerberos with Microsoft's JDBC driver (v7.0.0) (i.e., I specify integratedSecurity=true;authenticationScheme=JavaKerberos in the connection string) with credentials specified in a keyTab file and I am not having much success (the problem is the same if I specify credentials in the connections string).

I am submitting the job to the cluster (4-node YARN mode v 2.3.0) with:

spark-submit --driver-class-path mssql-jdbc-7.0.0.jre8.jar \
--jars /path/to/mssql-jdbc-7.0.0.jre8.jar \
--conf spark.executor.extraClassPath=/path/to/mssql-jdbc-7.0.0.jre8.jar \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/path/to/SQLJDBCDriver.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/path/to/SQLJDBCDriver.conf" \
application.jar

Things work partially: the spark driver authenticates correctly and creates the table, however when any of the executors come to write to the table they fail with an exception:

java.security.PrivilegedActionException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)

Observations:

  • I can get everything to work if I specify SQL server credentials (however I need to use integrated security in my application)
  • The keytab and login module file “SQLJDBCDriver.conf” seem to be specified correctly since they work for the driver
  • I can see in the spark UI the executors pick up the correct command line options : -Djava.security.auth.login.config=/path/to/SQLJDBCDriver.conf

After a lot of logging/debugging the difference in spark driver and executor behaviour, it seems to come down to the executor trying to use the wrong credentials even though the options specified should make it use those specified in the keytab file as it does successfully for the spark driver. (That is why it generates that particular exception which is what it does if I try deliberately incorrect credentials.)

Strangely, I can see in the debug output the JDBC driver finds and reads the SQLJDBCDriver.conf file and the keytab has to present (otherwise I get file not found failure) yet it then promptly ignores them and tries to use default behaviour/local user credentials.

Can anyone help me understand how I can force the executors to use credentials provided in a keytab or otherwise get JavaKerberos/SQL Server authentication to work with Spark?

quarkonium
  • 322
  • 3
  • 13

4 Answers4

1

Just to give an update on this, I've just closed https://issues.apache.org/jira/browse/SPARK-12312 and now it's possible to do Kerberos authentication with built-in JDBC connection providers. There are many providers added and one of them is MS SQL. Please read the documentation how to use it: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Please be aware Spark 3.1 is not yet released so it will take some time when the newly added 2 configuration parameters appear on the page (keytab and principal). I think the doc update will happen within 1-2 weeks.

Gabor Somogyi
  • 136
  • 1
  • 5
1

Integrated authentication does not work with MS SQLServer JDBC driver in a secure cluster with AD integration as the containers will not have the context as the Kerberos tokens are lost when the mappers spawn (as the YARN transitions the job to its internal security subsystem).

Here is my repo that was used as work around to get Kerberos/AD authentication https://github.com/chandanbalu/mssql-jdbc-krb5 solution implements a Driver that overrides connect method of the latest MS SQL JDBC Driver (mssql-jdbc-9.2.1.jre8.jar), and will get a ticket for keytab file/principal, and gives this connection back.

You can grab the latest build of this custom driver from release folder here

Start spark-shell with JARS

spark-shell --jars /efs/home/c795701/.ivy2/jars/mssql-jdbc-9.2.1.jre8.jar,/efs/home/c795701/mssql-jdbc-krb5/target/scala-2.10/mssql-jdbc-krb5_2.10-1.0.jar

Scala

scala>val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:krb5ss://<SERVER_NAME>:1433;databasename=<DATABASE_NAME>;integratedSecurity=true;authenticationScheme=JavaKerberos;krb5Principal=c795701@NA.DOMAIN.COM;krb5Keytab=/efs/home/c795701/c795701.keytab").option("driver","hadoop.sqlserver.jdbc.krb5.SQLServ, "dbo.table_name").load()

scala>jdbcDF.count()
scala>jdbcDF.show(10)

spark-submit command

com.spark.SparkJDBCIngestion - Spark JDBC data frame operations

ingestionframework-1.0-SNAPSHOT.jar - Your project build JAR

spark-submit \
--master yarn \
--deploy-mode cluster \
--jars "/efs/home/c795701/mssql-jdbc-krb5/target/scala-2.10/mssql-jdbc-krb5_2.10-1.0.jar,/efs/home/c795701/.ivy2/jars/scala-library-2.11.1.jar"
--files /efs/home/c795701/c795701.keytab
--class com.spark.SparkJDBCIngestion \
/efs/home/c795701/ingestionframework/target/ingestionframework-1.0-SNAPSHOT.jar
0

So apparently JDBC Kerberos authentication is just not possible currently on the executors according to an old JIRA here https://issues.apache.org/jira/browse/SPARK-12312. The behaviour is the same as of version 2.3.2 according to the spark user list and my testing.

Workarounds

  1. Use kinit and then distribute the cached TGT to the executors as detailed here: https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_Executors_Kerberos_HowTo.md. I think this technique only works for the user that spark executors run under. At least I couldn't get it to work for my use case.
  2. Wrap the jdbc driver with a custom version that deals with the authentication and then calls and returns a connection from the real MS JDBC driver. Details here: https://datamountaineer.com/2016/01/15/spark-jdbc-sql-server-kerberos/ and the associated repo here: https://github.com/nabacg/krb5sqljdb. I got this technique to work though I had to modify the authentication code for my case.
quarkonium
  • 322
  • 3
  • 13
0

as Gabor Somogyi said. you need to use spark > 3.1.0 and keytab and principal arguments I have 3.1.1.

  1. Throw a keytab on the same path for ALL HOST and machine where you use your code - and keep keytab up to date
  2. add to connection string value integratedSecurity=true;authenticationScheme=JavaKerberos;
  3. reading block will look like:
jdbcDF = (spark.read
        .format("com.microsoft.sqlserver.jdbc.spark")
        .option("url", url)
        .option("dbtable", table_name)
        .option("principal", "username@domen")
        .option("keytab", "sameALLhostKEYTABpath")
        .load()
)