I am working on a simple python script to stream messages from Kafka using pyspark, and I'm doing so using jupyter.
I get an error message saying Spark Streaming's Kafka libraries not found in class path
(more details below). I included the solution suggested by @tshilidzi-mudau in a previous post (and confirmed in the docs) to avoid this problem. What should I do to fix the bug?
Following what suggested in the error prompt, I downloaded the JAR of the artifact, stored it in $SPARK_HOME/jars
and included the reference in the code.
Here is the code:
import os
from __future__ import print_function
import sys
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars spark-streaming-kafka-0-10-assembly_2.10-2.2.2.jar pyspark-shell' #note that the "pyspark-shell" part is very important!!.
#conf = SparkConf().setAppName("Kafka-Spark").setMaster("spark://127.0.0.1:7077")
conf = SparkConf().setAppName("Kafka-Spark")
#sc = SparkContext(appName="KafkaSpark")
try:
sc.stop()
except:
pass
sc = SparkContext(conf=conf)
stream=StreamingContext(sc,1)
map1={'spark-kafka':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:9092', "name", map1) #tried with localhost:2181 too
print("kafkastream=",kafkaStream)
sc.stop()
And this is the error:
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.2.2 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.2.2.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
TypeError Traceback (most recent call last)
<ipython-input-9-34de7dbdfc7c> in <module>()
13 ssc = StreamingContext(sc,1)
14 broker = "<my_broker_ip>"
---> 15 directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"], {"metadata.broker.list": broker})
16 directKafkaStream.pprint()
17 ssc.start()
/opt/spark/python/pyspark/streaming/kafka.pyc in createDirectStream(ssc, topics, kafkaParams, fromOffsets, keyDecoder, valueDecoder, messageHandler)
120 return messageHandler(m)
121
--> 122 helper = KafkaUtils._get_helper(ssc._sc)
123
124 jfromOffsets = dict([(k._jTopicAndPartition(helper),
/opt/spark/python/pyspark/streaming/kafka.pyc in _get_helper(sc)
193 def _get_helper(sc):
194 try:
--> 195 return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper()
196 except TypeError as e:
197 if str(e) == "'JavaPackage' object is not callable":
TypeError: 'JavaPackage' object is not callable