0

I am trying to submit a dataproc job that will consume data from a Kerberized Kafka cluster. Current working solution is to have the jaas config file and keytab on the machine which is making the dataproc jobs submit command:

gcloud dataproc jobs submit pyspark \
    --cluster MY-CLUSTER --region us-west1 --project MY_PROJECT \
    --files my_keytab_file.keytab,my_jaas_file.conf \
    --properties spark.driver.extraJavaOptions=-Djava.security.auth.login.config=my_jaas_file.conf,spark.executor.extraJavaOptions=-Djava.security.auth.login.config=my_jaas_file.conf \
    gs://CODE_BUCKET/path/to/python/main.py 

The contents of my_jaas_file.conf:

KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required
  debug=true
  useKeyTab=true
  serviceName="kafka"
  keyTab="my_keytab_file.keytab"
  principal="principal@MY.COMPANY.COM";
};

Consumer code:

spark = SparkSession \
    .builder \
    .appName("MY_APP") \
    .master("yarn") \
    .getOrCreate()

df = spark.read.format("kafka") \
    .option("kafka.bootstrap.servers", "BOOTSTRAP_SERVERS_LIST[broker:port,broker:port,broker:port]") \
    .option("kafka.sasl.mechanism", "GSSAPI") \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.group.id", "PREDEFINED_CG") \
    .option("subscribe", "MY_TOPIC") \
    .option("startingOffsets", "earliest") \
    .option("endingOffsets", "latest") \
    .load() 

df.show()

The files get copied to GCS and I think from there they are copied in yarn workspace. The JVM is able to pick them up and the authentication is successful.

However, this setup is not feasible as I will not be able to have access to keytab file. The keytab will be part of a deployment process and will be available on master and worker nodes, under a location on disk. A service will pick up the keytab file and mentain a cache file, which will become the source for kerberized kafka authentification.

I have tried making a jaas config file on master and each node:

nano /path/to/keytab/my_jaas_file.config
# variant 1
KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required
  debug=true
  useKeyTab=true
  serviceName="kafka"
  keyTab="/path/to/keytab/my_keytab_file.keytab"
  principal="principal@MY.COMPANY.COM";
};
# variant 2
KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required debug=true
  useTicketCache=true
  ticketCache="/path/to/keytab/krb5_ccache"
  serviceName="kafka"
  principal="principal@MY.COMPANY.COM";
};

And start the dataproc with the following configuration:

gcloud dataproc jobs submit pyspark \
    --cluster MY-CLUSTER --region us-west1 --project MY_PROJECT \
    --properties spark.driver.extraJavaOptions=-Djava.security.auth.login.config=file:///path/to/keytab/my_jaas_file.config,spark.executor.extraJavaOptions=-Djava.security.auth.login.config=file:///path/to/keytab/my_jaas_file.config \
    gs://CODE_BUCKET/path/to/python/main.py 

The jaas configuration file is correctly picked up and read by spark process from disk, because I intentionally deleted it from one node, and it failed with "File not found" error. The keytab file or ticketCache file is not being picked, and the following error is being generated:

org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner  authentication information from the user
javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner  authentication information from the user

After digging through Krb5LoginModule documentation, it seems that this is the default behavior:

When multiple mechanisms to retrieve a ticket or key is provided, the preference order is:

  1. ticket cache
  2. keytab
  3. shared state
  4. user prompt

For variant 1:

  1. It is able to pick settings from jaas file ( file:// reference works ) that is stored on local disk on each master / worker node
  2. Searches for keytab on /path/to/keytab/my_keytab_file.keytab
  3. Does not find it, searches if a keycache is available.
  4. Keycache is not available, goes for shared state
  5. No login information is defined in the shared state
  6. Asks for username and password -> which is not possible under current context ( pyspark job )
  7. Throws error: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner authentication information from the user

I have tried multiple ways for defining keytab / ccache file in jaas config:

keyTab="file:/path/to/keytab/my_keytab_file.keytab"
keyTab="file:///path/to/keytab/my_keytab_file.keytab"
keyTab="local:/path/to/keytab/my_keytab_file.keytab"

But none of them seems to pick up the so needed keytab file.

There are a lot of things spark and dataproc do behind the scenes.

1 Answers1

0

Managed to solve it!

It seems that the ccache file / keytab file were not accessible by any other users.

sudo chmod 744 /path/to/keytab/my_jaas_file.config
sudo chmod 744 /path/to/keytab/krb5_ccache
sudo chmod 744 /path/to/keytab/my_keytab_file.keytab

The job runs on driver with root user, but on executors is not ran with root. It is probably using yarn or hadoop user.

Hope this helps other wandering souls!