How to print PythonTransformedDStream

Question

I'm trying to run word count example integrating AWS Kinesis stream and Apache Spark. Random lines are put in Kinesis at regular intervals.

lines = KinesisUtils.createStream(...)

When I submit my application, lines.pprint() I don't see any values printed.

Tried to print the lines object and I see <pyspark.streaming.dstream.TransformedDStream object at 0x7fa235724950>

How to print the PythonTransformedDStream object? and check if the data is received.

I'm sure there is no credentials issue, if I use false credentials I get access exception.

Added the code for reference

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream

if __name__ == "__main__":
    sc = SparkContext(appName="SparkKinesisApp")
    ssc = StreamingContext(sc, 1)

    lines = KinesisUtils.createStream(ssc, "SparkKinesisApp", "myStream", "https://kinesis.us-east-1.amazonaws.com","us-east-1", InitialPositionInStream.LATEST, 2)

    # lines.saveAsTextFiles('/home/ubuntu/logs/out.txt')
    lines.pprint()

    counts = lines.flatMap(lambda line: line.split(" "))
                           .map(lambda word: (word, 1))
                           .reduceByKey(lambda a, b: a + b)

    counts.pprint()

    ssc.start()
    ssc.awaitTermination()

score 3 · Accepted Answer · answered Jan 31 '17 at 10:27

Finally I got it working.

The example code which I referred on https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py has a wrong command to submit application.

The correct command with which I got it working is

$ bin/spark-submit --jars external/spark-streaming-kinesis-asl_2.11-2.1.0.jar --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.1.0 /home/ubuntu/my_pyspark/spark_kinesis.py

not sure why are you submitting jars as well as packages? – iAviator Apr 01 '19 at 16:07 — iAviator, Apr 01 '19 at 16:07

Yaron · Answer 2 · 2017-01-25T11:57:09.520

2

Since lines.pprint() doesn't print anything, can you please confirm that you execute:

ssc.start()
ssc.awaitTermination()

as mentioned in the example here: https://github.com/apache/spark/blob/v2.1.0/examples/src/main/python/streaming/network_wordcount.py

pprint() should work when the environment is configured correctly:

http://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#output-operations-on-dstreams

Output Operations on DStreams

print() - Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. Python API This is called pprint() in the Python API.

edited Jan 25 '17 at 11:57

answered Jan 25 '17 at 11:37

Yaron

10,166
9
45
65

I've already tried the network word count program and `pprint` is working for that, so I guess, environment is configured appropriately. Also, the mentioned two lines are available at the end of my code. The program runs until I press ctrl+c. – ArunDhaJ Jan 30 '17 at 06:43
@ArunDhaJ - did you install netcat server (http://landoflinux.com/linux_netcat_command.html) and execute it using `$ nc -lk 9999` ? Did you enter words in the netcat console which will be input to your spark streaming program? – Yaron Jan 30 '17 at 12:44
I did tried network word count program with `nc` and successfully executed it. I'm only facing issues with Amazon Kinesis integration. I'm publishing random sentences to kinesis stream, however my spark client doesn't pick it and process. – ArunDhaJ Jan 31 '17 at 05:27

How to print PythonTransformedDStream

2 Answers2