I am writing a Spark Structured Streaming program. I have experience with Spark batch processing and Spark Streaming, however, in the case of Structured Streaming, I observed several differences.
To reproduce my issue, I provide the code snippet. This code consumes data.json
file stored in data
folder:
[
{"id": 77,"type": "person","timestamp": 1532609003},
{"id": 77,"type": "person","timestamp": 1532609005},
{"id": 78,"type": "crane","timestamp": 1532609005}
]
Code:
spark = SparkSession \
.builder \
.appName("Test") \
.master("local[2]") \
.getOrCreate()
schema = StructType([
StructField("id", IntegerType()),
StructField("type", StringType()),
StructField("timestamp", LongType())
])
ds = spark \
.readStream \
.format("json") \
.schema(schema) \
.load("data/")
times = ds.select("timestamp").rdd.flatMap(lambda x: x).collect()
ids = ds.select("id").rdd.flatMap(lambda x: x).collect()
# do othe operations with "times" and "ids"
df_persons = ds\
.filter(func.col("type") == "person") \
.drop("type")
query = df_persons \
.writeStream \
.format('console') \
.start()
query.awaitTermination()
For each mini batch, I must retrieve times
and ids
in order to apply a global operation on them.
But this code fails because I apply collect()
on ds
.
pyspark.sql.utils.AnalysisException: u'Queries with streaming sources must be executed with writeStream.start();;\nFileSource[data/]'
I tried to add writeStream.start().awaitTermination()
to times
, but it didn't solve the problem.