Confluent-kafka (with kerberos) Error when spark-submit python job in cluster Mode

Question

I am facing the following error while submitting a python job in a cluster mode:

appcache/application_1548793257188_803870/container_e80_1548793257188_803870_01_000001/environment/lib/python2.7/site-packages/confluent_kafka/init.py", line 2, in from .cimpl import (Consumer, # noqa ImportError: librdkafka.so.1: cannot open shared object file: No such file or directory

librdkafka and other python dependencies are installed ONLY on an edge node. Before submit, I create a virtual environment and pip install confluent-kafka in the following way:

pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org --no-binary :all: confluent-kafka

After that, I create environment.tar.gz and pass it to spark-submit with --archives

I have tried to set spark properties like that:

--conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib64:environment/lib/python2.7/site-packages/confluent_kafka/.libs"
--conf spark.driver.extraLibraryPath=/usr/lib64:environment/lib/python2.7/site-packages/confluent_kafka/.libs"
--conf spark.yarn.appMasterEnv.LD_LIBRARY_PATH=environment/lib/python2.7/site-packages/confluent_kafka/.libs"

But unfortuantly it didnt work!

Somebody faced the same problem?

Spark has its own Kafka libraries (written in Java, and uses Java Kerberos API). Not clear why you are trying to add `confluent-kafka` Python library into that — OneCricketeer, Mar 08 '19 at 17:30
https://github.com/markgrover/spark-secure-kafka-app#spark-secure-kafka-app — OneCricketeer, Mar 08 '19 at 17:33
The problem is that the job is written in python, consumer is of confluent-Kafka. And it s not possible to change the implementation. — Elisabetta, Mar 10 '19 at 15:36
I don't follow... the Spark Streaming Kafka implementation is written in Scala, but you're welcome to use PySpark to directly call that. You can't include the confluent-kafka Python code within Spark, and you need to use the Spark Context, not just a regular Python script with any library. https://medium.com/@mukeshkumar_46704/getting-streaming-data-from-kafka-with-spark-streaming-using-python-9cd0922fa904 — OneCricketeer, Mar 14 '19 at 23:37
Indeed I am using spark context within my python job. Any way, the problem was solved installing the librdkafka on each worker node of the cluster. — Elisabetta, Mar 19 '19 at 08:27

Confluent-kafka (with kerberos) Error when spark-submit python job in cluster Mode

0 Answers0