0

I am facing the following error while submitting a python job in a cluster mode:

appcache/application_1548793257188_803870/container_e80_1548793257188_803870_01_000001/environment/lib/python2.7/site-packages/confluent_kafka/init.py", line 2, in from .cimpl import (Consumer, # noqa ImportError: librdkafka.so.1: cannot open shared object file: No such file or directory

librdkafka and other python dependencies are installed ONLY on an edge node. Before submit, I create a virtual environment and pip install confluent-kafka in the following way:

pip install --trusted-host pypi.org --trusted-host files.pythonhosted.org --no-binary :all: confluent-kafka

After that, I create environment.tar.gz and pass it to spark-submit with --archives

I have tried to set spark properties like that:

--conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib64:environment/lib/python2.7/site-packages/confluent_kafka/.libs"
--conf spark.driver.extraLibraryPath=/usr/lib64:environment/lib/python2.7/site-packages/confluent_kafka/.libs"
--conf spark.yarn.appMasterEnv.LD_LIBRARY_PATH=environment/lib/python2.7/site-packages/confluent_kafka/.libs"

But unfortuantly it didnt work!

Somebody faced the same problem?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Elisabetta
  • 328
  • 3
  • 9
  • Spark has its own Kafka libraries (written in Java, and uses Java Kerberos API). Not clear why you are trying to add `confluent-kafka` Python library into that – OneCricketeer Mar 08 '19 at 17:30
  • https://github.com/markgrover/spark-secure-kafka-app#spark-secure-kafka-app – OneCricketeer Mar 08 '19 at 17:33
  • The problem is that the job is written in python, consumer is of confluent-Kafka. And it s not possible to change the implementation. – Elisabetta Mar 10 '19 at 15:36
  • I don't follow... the Spark Streaming Kafka implementation is written in Scala, but you're welcome to use PySpark to directly call that. You can't include the confluent-kafka Python code within Spark, and you need to use the Spark Context, not just a regular Python script with any library. https://medium.com/@mukeshkumar_46704/getting-streaming-data-from-kafka-with-spark-streaming-using-python-9cd0922fa904 – OneCricketeer Mar 14 '19 at 23:37
  • 1
    Indeed I am using spark context within my python job. Any way, the problem was solved installing the librdkafka on each worker node of the cluster. – Elisabetta Mar 19 '19 at 08:27
  • That's still the incorrect way to use Spark with Kafka... – OneCricketeer Mar 21 '19 at 18:43

0 Answers0