9

So I have some data I'm stream in a Kafka topic, I'm taking this streaming data and placing it into a DataFrame. I want to display the data inside of the DataFrame:

import os
from kafka import KafkaProducer
from pyspark.sql import SparkSession, DataFrame
import time
from datetime import datetime, timedelta

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0,org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 pyspark-shell'

topic_name = "my-topic"
kafka_broker = "localhost:9092"

producer = KafkaProducer(bootstrap_servers = kafka_broker)
spark = SparkSession.builder.getOrCreate()
terminate = datetime.now() + timedelta(seconds=30)

while datetime.now() < terminate:
    producer.send(topic = topic_name, value = str(datetime.now()).encode('utf-8'))
    time.sleep(1)

readDF = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafka_broker) \
    .option("subscribe", topic_name) \
    .load()
readDF = readDF.selectExpr("CAST(key AS STRING)","CAST(value AS STRING)")

readDF.writeStream.format("console").start()
readDF.show()

producer.close()

However I keep on getting this error:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/spark/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/spark/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o30.showString.
: org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
...
Traceback (most recent call last):
      File "test2.py", line 30, in <module>
        readDF.show()
      File "/home/spark/spark/python/pyspark/sql/dataframe.py", line 336, in show
        print(self._jdf.showString(n, 20))
      File "/home/spark/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
      File "/home/spark/spark/python/pyspark/sql/utils.py", line 69, in deco
        raise AnalysisException(s.split(': ', 1)[1], stackTrace)
    pyspark.sql.utils.AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nkafka'

I don't understand why the exception is happening, I'm calling writeStream.start() right before show(). I tried getting rid of selectExpr() but that made no difference. Does anyone know how to display a stream sourced DataFrame? I'm using Python 3.6.1, Kafka 0.10.2.1, and Spark 2.2.0

zero323
  • 322,348
  • 103
  • 959
  • 935
user2361174
  • 1,872
  • 4
  • 33
  • 51

3 Answers3

18

Streaming DataFrame doesn't support the show() method. When you call start() method, it will start a background thread to stream the input data to the sink, and since you are using ConsoleSink, it will output the data to the console. You don't need to call show().

Remove readDF.show() and add a sleep after that, then you should be able to see data in the console, such as

query = readDF.writeStream.format("console").start()
import time
time.sleep(10) # sleep 10 seconds
query.stop()

You also need to set startingOffsets to earliest, otherwise, Kafka source will just start from the latest offset and fetch nothing in your case.

readDF = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafka_broker) \
    .option("startingOffsets", "earliest") \
    .option("subscribe", topic_name) \
    .load()
zsxwing
  • 20,270
  • 4
  • 37
  • 59
  • The first time I ran it I got `java.lang.InterruptedException` after the table was displayed that said `WARN Shell: Interrupted while joining on: Thread[Thread-80,5,main]` although I ran it again and it didn't appear. Any idea what caused this? – user2361174 Jul 13 '17 at 23:59
  • 1
    It should be just a warning. When `query.stop()` is called, it will send an interrupt signal to the stream thread, and it may throw InterruptedException. It should not fail your codes. – zsxwing Jul 14 '17 at 00:27
  • I see, one last question. I've noticed that the query prints out even the rows within the kafka topic that I had before I wrote to it. This isn't a problem although because of this I'm trying to get the number of rows found in my data frame. I tried following this answer by doing `q2 = readDF.count().writeStream.format("console").start()` but I get the same AnalysisException as I did before. Any idea on how to do this? – user2361174 Jul 14 '17 at 01:11
  • Actually I think I got it. I made another DataFrame similar to readDF except I replaced `readStream` with `read`. That allows me to use the `count()` method. – user2361174 Jul 14 '17 at 02:16
0

Streaming DataFrame doesn't support the show() method directly, but there is a way to see your data by making your back ground thread sleep for some moments and using the show() function on the temp table created in memory sink. I can help with the pyspark way of using the show() method.

Refer to my answer here

RainaMegha
  • 126
  • 1
  • 4
  • 1
    If the answer to a question is a link to another answer that may be an indication that the question should be marked as a duplicate. – Jason Aller Jun 02 '20 at 22:27
0

Since your input data is a stream, your output data is also a stream. This means you can't use readDF.show() - otherwise you'll get an error, as you've seen. You're most of the way there... The start() function returns a StreamingQuery instance, which will display the data, but in order to see the data, you need wait, otherwise your code will continue and complete before displaying anything. You just need to update your code to capture the streaming query and use awaitTermination to cause your code to wait while the streaming data arrives.

streaming_query = readDF.writeStream.format("console").start()
streaming_query.awaitTermination()

When you run this, you will see your data appear, and update as new records come in from the stream.

You don't have an aggregation in your code, so you may need to read that instead:

# Cast the Kafka value to string, and group by it
df = readDF.select(lines.value.cast("string")).groupby("value").count()

streaming_query = df.writeStream.format("console").start()

streaming_query.awaitTermination()
JGC
  • 5,725
  • 1
  • 32
  • 30