pyspark is unable to find KafkaUtils.createDirectStream

Question

I have the following pyspark script, which suppose to connect to a local kafka cluster:

from pyspark import SparkConf, SparkContext

from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
## Constants
APP_NAME = "PythonStreamingDirectKafkaWordCount"
##OTHER FUNCTIONS/CLASSES

def main():
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 2)

    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
    lines = kvs.map(lambda x: x[1])
    counts = lines.flatMap(lambda line: line.split(" ")) \
        .map(lambda word: (word, 1)) \
        .reduceByKey(lambda a, b: a+b)
    counts.pprint()

    ssc.start()
    ssc.awaitTermination()
if __name__ == "__main__":

   main()

When I run this, I get the following error:

File "/home/ubuntu/spark-1.3.0-bin-hadoop2.4/hello1.py", line 16, in main
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
AttributeError: type object 'KafkaUtils' has no attribute 'createDirectStream'

What should I do in order to have access to KafkaUtils.createDirectStream ?

score 1 · Accepted Answer · answered May 19 '16 at 15:24

1

You're using Spark 1.3.0 and Python version of createDirectStream has been introduced in Spark 1.4.0. Spark 1.3 provides only Scala and Java implementations.

If you want to use direct stream you'll have to update your Spark installation.

answered May 19 '16 at 15:24

zero323

322,348
103
959
935

pyspark is unable to find KafkaUtils.createDirectStream

1 Answers1