0

Good morning / afternoon / evening!

Spark 2.4.x, with Hive 1.2.1

Source code here: https://github.com/apache/spark/blob/master/sql/hive-thriftserver/v1.2/src/main/java/org/apache/hive/service/auth/KerberosSaslHelper.java

 public static TTransport getKerberosTransport(String principal, String host,
    TTransport underlyingTransport, Map<String, String> saslProps, boolean assumeSubject)
    throws SaslException {
    try {
      String[] names = principal.split("[/@]");
      if (names.length != 3) {
        throw new IllegalArgumentException("Kerberos principal should have 3 parts: " + principal);
      }

Now the question:

Does anyone know why spark thrift server needs a 3 part kerberos principle?

Spark thrift server works by submitting a long running job, which does not need a 3 part kerberos principle.

Start a service listening to a port, does not need a 3 part kerberos principle (just like spark job history), right?

So wonder why this code checks if the principle has 3 parts or not...

Thanks!

kcode2019
  • 119
  • 1
  • 7
  • A Kerberos **client** just needs a UPN (User Principal Name) e.g. `john-doe@ACME.CORP` but a Kerberos **service** needs a SPN (Service Principal Name) for a specific service type and a specific server name e.g. `hive/some.server.at.xyz.acme.corp@ACME.CORP` -- the server name is checked by forward then reverse DNS lookups to avoid various types of network attacks _(e.g. man-in-the-middle)_ just as the ticket timestamp is checked to avoid replay attacks etc. etc. – Samson Scharfrichter Mar 13 '20 at 22:12
  • Note that mentioning the realm is important for multinational companies, e.g. a UPN `john.doe@CA.ACME.CORP` that needs to connect to a service `@CN.ACME.CORP` first authenticates against the KDC for `CA.ACME.CORP` then hops to the parent KDC for `ACME.CORP` then hops to the target KDC for `CN.ACME.CORP` to obtain its service ticket - which is finally presented to the Hive service... – Samson Scharfrichter Mar 13 '20 at 22:17
  • BTW which version of the Spark History Server are you refering to - Spark 2.x with no Kerberos authentication (just needs a UPN to authenticate against HDFS) or Spark 3.x with Kerberos auth (needs a UPN as `spark` **and** a SPN as `HTTP/server.name`) ?? – Samson Scharfrichter Mar 13 '20 at 22:30
  • _"Spark thrift server works by submitting a long running job"_ >> this is not the point. Spark Thrift Server emulates HiveServer2, so it has to use the same transport protocols _(either Thrift binary or Thrift-over-HTTP)_ and the same authentication options. For Kerberos authentication, that implies a SPN (hence the 3 parts) which is normally `hive/server.name@REAL.M` -- plus `HTTP/server.name@REAL.M` when using the HTTP wrapper around the Thrift payload but that one is implicit – Samson Scharfrichter Mar 13 '20 at 22:44
  • Hum, then why the spark job history server doesn't require a 3 part erberos principle? (I am talking about Spark 2.4.x, with hive 1.2.x). Thanks @SamsonScharfrichter – kcode2019 Mar 13 '20 at 23:35
  • Repeating: Spark History Server ... 2.x with **no Kerberos authentication**. When you connect to SHS, you don't present a Kerberos ticket via SPNego protocol. Connection is anonymous. No access control. Nada. Open bar. >> But since SHS needs to access HDFS, as a _client_, it needs a UPN to authenticate against the NameNode and DataNodes (which use a SPN because they are kerberized services) – Samson Scharfrichter Mar 14 '20 at 10:14
  • Repeating louder: a process acting as _client_ just needs a UPN (can use a SPN if it already has one but that's just an option). A process acting as _service_ needs a SPN. – Samson Scharfrichter Mar 14 '20 at 10:17

1 Answers1

0

The three parts of a Kerberos principal are {service}/{name}[@{realm}]. So e.g. host/server.domain.com@realm.com or host/server.domain.com. The client relies on this information to determine where the Kerberos KDC is so it can request a ticket. The realm portion is often optional if the name ends in the realm name (e.g. server.realm.com). In this case the client is just being explicit about it.

I have no idea why the author chose to require all three. It does make the client implementation easier because you don't have to guess at the intent, but it's at the expense of the simplicity of the API.

Steve
  • 4,463
  • 1
  • 19
  • 24
  • _"The realm portion is often optional if the name ends in the realm name"_ >> ugh, there is no obvious relationship between the network domain and the Kerberos realm, except in Microsoft Active Directory (which is kind of a monstruous fork of standard Kerberos IMHO). Most large companies that I have worked with (the kind with 10+ data centers worldwide) don't use AD for DNS, and use `krb5.conf` to map network domains (and sometimes individual servers) to kerberos realms. With one exception -- and they got the NotPetya malware so I guess they ditched as much of AD as they could since then. – Samson Scharfrichter Mar 13 '20 at 22:24