0

I try to integrate spark and kafka in Jupyter notebook by using pyspark. Here is my work environment.

Spark version: Spark 2.2.1 Kafka version: Kafka_2.11-0.8.2.2 Spark streaming kafka jar: spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar

I added a Spark streaming kafka assembly jar file to spark-defaults.conf file.

When i start streamingContext for pyspark streaming, this error appears as can't read kafka version from MANIFEST.MF.

enter image description here

Here is my code.

from pyspark import SparkContext, SparkConf
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import sys
import os

from kafka import KafkaProducer

#Receive data handler
def handler(message):
    records = message.collect()
    for record in records:
        print(record)
        #producer.send('receive', str(res))
        #producer.flush()

producer = KafkaProducer(bootstrap_servers='slave02:9092')
sc = SparkContext(appName="SparkwithKafka")
ssc = StreamingContext(sc, 1)

#Create Kafka streaming with argv
zkQuorum = 'slave02:2181'
topic = 'send'
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic:1})
kvs.foreachRDD(handler)

ssc.start()
Ruslan Ostafiichuk
  • 4,422
  • 6
  • 30
  • 35
Daniel Lee
  • 45
  • 1
  • 10
  • What is your command for submitting the code? Or how are you loading the JARs into jupyter? – OneCricketeer Aug 08 '18 at 13:43
  • 1
    EDIT : My bad, doesn't looks like an api for python Note : the KafkaUtils.createStream is the old way to read a kafka topic. You should use [Kafka 0.10 api](https://spark.apache.org/docs/2.2.1/streaming-kafka-0-10-integration.html) – Bameza Aug 08 '18 at 15:16
  • @cricket_007 ssc.start() is starting code. I tested in Jupyter notebook. I appended jar route into spark-defaults.conf, for example spark.jars spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar – Daniel Lee Aug 08 '18 at 18:12
  • @Bameza But I know the spark is not compatible with version 0.10.0. [link](https://spark.apache.org/docs/2.2.1/streaming-kafka-integration.html) So that is why i uses kafka 0.8.0 API. Is it possible can i uses kafka 0.10.0 API in Jupyter notebook? – Daniel Lee Aug 08 '18 at 18:16
  • 1
    I figured out this problem. This is a warning and my application won't crash because of it. It works well for me. Thanks everyone! – Daniel Lee Aug 08 '18 at 18:44
  • Spark definitely is compatible with Kafka 0.10 libraries. I think that "experimental" note on that page is outdated. – OneCricketeer Aug 08 '18 at 19:01

1 Answers1

2

Sorry for my posting in Scala

Spark 2.2.1 with Scala 2.11 and Kafka 0.10 do all work though they are marked as experimental

The proper way to create a stream if using above libraries is to use

val kStrream =  KafkaUtils.createDirectStream(
          ssc, PreferConsistent,
          Subscribe[String, String](Array("weblogs-text"), kafkaParams, fromOffsets))

Pay attention to the dependencies for example kafka has jar files that are specific to the version of Kafka Client version and spark version.

       <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>2.2.1</version>
            <scope>provided</scope>
        </dependency>
skjagini
  • 3,142
  • 5
  • 34
  • 63