8

I'm trying to write a simple pyspark job, which would receive data from a kafka broker topic, did some transformation on that data, and put the transformed data on a different kafka broker topic.

I have the following code, which reads data from a kafka topic, but has no effect running sendkafka function:

from pyspark import SparkConf, SparkContext

from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient

def sendkafka(messages):
    kafka = KafkaClient("localhost:9092")
    producer = SimpleProducer(kafka)
    for message in messages:
        yield producer.send_messages('spark.out', message)

def main():
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 5)

    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
    parsed = kvs.map(lambda (key, value): json.loads(value))
    parsed.pprint()

    sentRDD = kvs.mapPartitions(sendkafka)
    sentRDD.count()

    ssc.start()
    ssc.awaitTermination()
if __name__ == "__main__":

   main()

What should I change, in order to make my sendkafka function to actually send data to the spark.out kafka topic?

Eugene Goldberg
  • 14,286
  • 20
  • 94
  • 167

1 Answers1

14

Here is the correct code, which reads from Kafka into Spark, and writes spark data back to a different kafka topic:

from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers='localhost:9092')

def handler(message):
    records = message.collect()
    for record in records:
        producer.send('spark.out', str(record))
        producer.flush()

def main():
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 10)

    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
    kvs.foreachRDD(handler)

    ssc.start()
    ssc.awaitTermination()
if __name__ == "__main__":

   main()

The way to run this is:

spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.1.jar s.py localhost:9092 test
Eugene Goldberg
  • 14,286
  • 20
  • 94
  • 167
  • that's give below error Spark Streaming's Kafka libraries not found in class path. Try one of the following. 1. Include the Kafka library and its dependencies with in the spark-submit command as $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka:1.6.0 ... 2. Download the JAR of the artifact from Maven Central http://search.maven.org/, Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-assembly, Version = 1.6.0. Then, – Beyhan Gul Dec 13 '16 at 14:17
  • 3
    @beyhan this answer only works as local model, not cluster, – avocado Sep 25 '17 at 13:56
  • 1
    @BeyhanGul you need to add --packages org.apache.spark:spark-streaming-kafka- to the command – Chandan Feb 15 '18 at 06:13
  • @avocado I know it's been a long time but I'm working on the same thing and I'm wondering why wouldn't it work in a cluster mode? because i'm gonna need to create a cluster later, thank you. – Haytam Apr 19 '18 at 20:14
  • Hello @Gagan, sorry but I didn't get the chance to try it in cluster mode. – Haytam Jun 08 '18 at 21:31