I have a Kafka 2.3 message broker and want to do some processing with data of the messages within Spark. For the beginning I want to use the Spark 2.4.0 that is integrated in Zeppelin 0.8.1 and want to use the Zeppelin notebooks for rapid prototyping.
For this streaming task I need "spark-streaming-kafka-0-10" for Spark>2.3 according to https://spark.apache.org/docs/latest/streaming-kafka-integration.html that only supports Java and Scale (and not Python). But there are no default Java or Scala interpreters in Zeppelin.
If I try this code (taken from https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/)
%spark.pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:9092', 'spark-streaming', {'test':1})
I get the following error
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
Include the Kafka library and its dependencies with in the spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.0 ...
Download the JAR of the artifact from Maven Central http://search.maven.org/, Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0. Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
Fail to execute line 1: kafkaStream = KafkaUtils.createStream(ssc, 'localhost:9092', 'spark-streaming', {'test':1}) Traceback (most recent call last): File "/tmp/zeppelin_pyspark-8982542851842620568.py", line 380, in exec(code, _zcUserQueryNameSpace) File "", line 1, in File "/usr/local/analyse/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream helper = KafkaUtils._get_helper(ssc._sc) File "/usr/local/analyse/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper() TypeError: 'JavaPackage' object is not callable
So I wonder how to tackle the task:
- Should I really use spark-streaming-kafka-0-8 despited being deprecated since some months? But spark-streaming-kafka-0-10 seems to be in the default zeppelin-jar directory.
- Configure/Create interpreter in Zeppelin for Java/Scala since spark-streaming-kafka-0-10 does only support these langauges?
- Ignore Zeppelin and do it on the console using "spark-submit"?