Run kafkastream with JAR artifact in Jupyter

Question

I am working on a simple python script to stream messages from Kafka using pyspark, and I'm doing so using jupyter.

I get an error message saying Spark Streaming's Kafka libraries not found in class path (more details below). I included the solution suggested by @tshilidzi-mudau in a previous post (and confirmed in the docs) to avoid this problem. What should I do to fix the bug?

Following what suggested in the error prompt, I downloaded the JAR of the artifact, stored it in $SPARK_HOME/jars and included the reference in the code.

Here is the code:

import os
from __future__ import print_function
import sys
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils

if __name__ == "__main__":

    os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars spark-streaming-kafka-0-10-assembly_2.10-2.2.2.jar pyspark-shell' #note that the "pyspark-shell" part is very important!!.

    #conf = SparkConf().setAppName("Kafka-Spark").setMaster("spark://127.0.0.1:7077")
    conf = SparkConf().setAppName("Kafka-Spark")
    #sc = SparkContext(appName="KafkaSpark")

    try:
        sc.stop()
    except:
        pass

    sc = SparkContext(conf=conf)
    stream=StreamingContext(sc,1)
    map1={'spark-kafka':1}
    kafkaStream = KafkaUtils.createStream(stream, 'localhost:9092', "name", map1) #tried with localhost:2181 too

    print("kafkastream=",kafkaStream)
    sc.stop()

And this is the error:

  Spark Streaming's Kafka libraries not found in class path. Try one of the following.

  1. Include the Kafka library and its dependencies with in the
     spark-submit command as

     $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.2.2 ...

  2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
     Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.2.2.
     Then, include the jar in the spark-submit command as

     $ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

TypeError                                 Traceback (most recent call last)
<ipython-input-9-34de7dbdfc7c> in <module>()
     13 ssc = StreamingContext(sc,1)
     14 broker = "<my_broker_ip>"
---> 15 directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"], {"metadata.broker.list": broker})
     16 directKafkaStream.pprint()
     17 ssc.start()

/opt/spark/python/pyspark/streaming/kafka.pyc in createDirectStream(ssc, topics, kafkaParams, fromOffsets, keyDecoder, valueDecoder, messageHandler)
    120             return messageHandler(m)
    121 
--> 122         helper = KafkaUtils._get_helper(ssc._sc)
    123 
    124         jfromOffsets = dict([(k._jTopicAndPartition(helper),

/opt/spark/python/pyspark/streaming/kafka.pyc in _get_helper(sc)
    193     def _get_helper(sc):
    194         try:
--> 195             return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper()
    196         except TypeError as e:
    197             if str(e) == "'JavaPackage' object is not callable":

TypeError: 'JavaPackage' object is not callable

What makes you think that you use Spark 2.2 with Scala 2.10? That's highly unusual and unlikely configuration. Most likely it should be `spark-streaming-kafka-0-10-assembly_2.11-2.2.2.jar` — zero323, Nov 18 '18 at 15:08
In addition to that, I'd suggest using Structured Streaming instead — OneCricketeer, Nov 18 '18 at 16:13
FWIW, your code looks roughly similar to this one https://stackoverflow.com/q/53296850/2308683 — OneCricketeer, Nov 18 '18 at 16:32
Possible duplicate of [Resolving dependency problems in Apache Spark](https://stackoverflow.com/questions/41383460/resolving-dependency-problems-in-apache-spark) — zero323, Nov 18 '18 at 17:41
Thanks everybody for the suggestions. At the end I gave up and switched to (this)[https://github.com/Yannael/kafka-sparkstreaming-cassandra] Docker. Dear reader, if you have similar problems install Docker and forget about all the mess. — albus_c, Nov 18 '18 at 17:43
I would generally avoid stuffing Kafka, Spark, and Cassandra all in one container. Kafka and Cassandra can run in Docker, sure, but defintiely separate, and Spark should run fine externally of those to read/write data (assuming Python and Java are setup for doing that) — OneCricketeer, Nov 20 '18 at 19:56

Run kafkastream with JAR artifact in Jupyter

0 Answers0