0

I have the following pyspark script, which suppose to connect to a local kafka cluster:

from pyspark import SparkConf, SparkContext

from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
## Constants
APP_NAME = "PythonStreamingDirectKafkaWordCount"
##OTHER FUNCTIONS/CLASSES

def main():
    sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
    ssc = StreamingContext(sc, 2)

    brokers, topic = sys.argv[1:]
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
    lines = kvs.map(lambda x: x[1])
    counts = lines.flatMap(lambda line: line.split(" ")) \
        .map(lambda word: (word, 1)) \
        .reduceByKey(lambda a, b: a+b)
    counts.pprint()

    ssc.start()
    ssc.awaitTermination()
if __name__ == "__main__":

   main()

When I run this, I get the following error:

File "/home/ubuntu/spark-1.3.0-bin-hadoop2.4/hello1.py", line 16, in main
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
AttributeError: type object 'KafkaUtils' has no attribute 'createDirectStream'

What should I do in order to have access to KafkaUtils.createDirectStream ?

Eugene Goldberg
  • 14,286
  • 20
  • 94
  • 167

1 Answers1

1

You're using Spark 1.3.0 and Python version of createDirectStream has been introduced in Spark 1.4.0. Spark 1.3 provides only Scala and Java implementations.

If you want to use direct stream you'll have to update your Spark installation.

zero323
  • 322,348
  • 103
  • 959
  • 935