2

I'm trying to run word count example integrating AWS Kinesis stream and Apache Spark. Random lines are put in Kinesis at regular intervals.

lines = KinesisUtils.createStream(...)

When I submit my application, lines.pprint() I don't see any values printed.

Tried to print the lines object and I see <pyspark.streaming.dstream.TransformedDStream object at 0x7fa235724950>

How to print the PythonTransformedDStream object? and check if the data is received.

I'm sure there is no credentials issue, if I use false credentials I get access exception.

Added the code for reference

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream

if __name__ == "__main__":
    sc = SparkContext(appName="SparkKinesisApp")
    ssc = StreamingContext(sc, 1)

    lines = KinesisUtils.createStream(ssc, "SparkKinesisApp", "myStream", "https://kinesis.us-east-1.amazonaws.com","us-east-1", InitialPositionInStream.LATEST, 2)

    # lines.saveAsTextFiles('/home/ubuntu/logs/out.txt')
    lines.pprint()

    counts = lines.flatMap(lambda line: line.split(" "))
                           .map(lambda word: (word, 1))
                           .reduceByKey(lambda a, b: a + b)

    counts.pprint()

    ssc.start()
    ssc.awaitTermination()
ArunDhaJ
  • 621
  • 6
  • 18

2 Answers2

3

Finally I got it working.

The example code which I referred on https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py has a wrong command to submit application.

The correct command with which I got it working is

$ bin/spark-submit --jars external/spark-streaming-kinesis-asl_2.11-2.1.0.jar --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.1.0 /home/ubuntu/my_pyspark/spark_kinesis.py
ArunDhaJ
  • 621
  • 6
  • 18
2

Since lines.pprint() doesn't print anything, can you please confirm that you execute:

ssc.start()
ssc.awaitTermination()

as mentioned in the example here: https://github.com/apache/spark/blob/v2.1.0/examples/src/main/python/streaming/network_wordcount.py

pprint() should work when the environment is configured correctly:

http://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#output-operations-on-dstreams

Output Operations on DStreams

print() - Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. Python API This is called pprint() in the Python API.

Yaron
  • 10,166
  • 9
  • 45
  • 65
  • I've already tried the network word count program and `pprint` is working for that, so I guess, environment is configured appropriately. Also, the mentioned two lines are available at the end of my code. The program runs until I press ctrl+c. – ArunDhaJ Jan 30 '17 at 06:43
  • @ArunDhaJ - did you install netcat server (http://landoflinux.com/linux_netcat_command.html) and execute it using `$ nc -lk 9999` ? Did you enter words in the netcat console which will be input to your spark streaming program? – Yaron Jan 30 '17 at 12:44
  • I did tried network word count program with `nc` and successfully executed it. I'm only facing issues with Amazon Kinesis integration. I'm publishing random sentences to kinesis stream, however my spark client doesn't pick it and process. – ArunDhaJ Jan 31 '17 at 05:27