How to do streaming Kafka->Zeppelin->Spark with current versions

Question

I have a Kafka 2.3 message broker and want to do some processing with data of the messages within Spark. For the beginning I want to use the Spark 2.4.0 that is integrated in Zeppelin 0.8.1 and want to use the Zeppelin notebooks for rapid prototyping.

For this streaming task I need "spark-streaming-kafka-0-10" for Spark>2.3 according to https://spark.apache.org/docs/latest/streaming-kafka-integration.html that only supports Java and Scale (and not Python). But there are no default Java or Scala interpreters in Zeppelin.

If I try this code (taken from https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/)

%spark.pyspark
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json

sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:9092', 'spark-streaming', {'test':1})

I get the following error

Spark Streaming's Kafka libraries not found in class path. Try one of the following.

Include the Kafka library and its dependencies with in the spark-submit command as

$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.0 ...

Download the JAR of the artifact from Maven Central http://search.maven.org/, Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.0. Then, include the jar in the spark-submit command as

$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...

Fail to execute line 1: kafkaStream = KafkaUtils.createStream(ssc, 'localhost:9092', 'spark-streaming', {'test':1}) Traceback (most recent call last): File "/tmp/zeppelin_pyspark-8982542851842620568.py", line 380, in exec(code, _zcUserQueryNameSpace) File "", line 1, in File "/usr/local/analyse/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 78, in createStream helper = KafkaUtils._get_helper(ssc._sc) File "/usr/local/analyse/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 217, in _get_helper return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper() TypeError: 'JavaPackage' object is not callable

So I wonder how to tackle the task:

Should I really use spark-streaming-kafka-0-8 despited being deprecated since some months? But spark-streaming-kafka-0-10 seems to be in the default zeppelin-jar directory.
Configure/Create interpreter in Zeppelin for Java/Scala since spark-streaming-kafka-0-10 does only support these langauges?
Ignore Zeppelin and do it on the console using "spark-submit"?

Why not [Structured Streaming](http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html)? Are you sure you want to use Spark Streaming not Spark Structured Streaming? — Jacek Laskowski, Jul 18 '19 at 19:22
Zeppelin's default Spark interpreter is Scala, not sure where you found it doesn't exist — OneCricketeer, Jul 19 '19 at 04:23
Thanks for the hint. I tried structured streaming but can also not make it work: https://stackoverflow.com/questions/57109453/structured-streaming-kafka-2-1-zeppelin-0-8-spark-2-4-spark-does-not-use-jar — tardis, Jul 19 '19 at 09:25

How to do streaming Kafka->Zeppelin->Spark with current versions

0 Answers0