0

I am writing a Spark Structured Streaming program. I have experience with Spark batch processing and Spark Streaming, however, in the case of Structured Streaming, I observed several differences.

To reproduce my issue, I provide the code snippet. This code consumes data.json file stored in data folder:

[
  {"id": 77,"type": "person","timestamp": 1532609003},
  {"id": 77,"type": "person","timestamp": 1532609005},
  {"id": 78,"type": "crane","timestamp": 1532609005}
]

Code:

spark = SparkSession \
    .builder \
    .appName("Test") \
    .master("local[2]") \
    .getOrCreate()

schema = StructType([
    StructField("id", IntegerType()),
    StructField("type", StringType()),
    StructField("timestamp", LongType())
])

ds = spark \
    .readStream \
    .format("json") \
    .schema(schema) \
    .load("data/")

times = ds.select("timestamp").rdd.flatMap(lambda x: x).collect()
ids = ds.select("id").rdd.flatMap(lambda x: x).collect()
# do othe operations with "times" and  "ids"

df_persons = ds\
              .filter(func.col("type") == "person") \
              .drop("type")

query = df_persons \
    .writeStream \
    .format('console') \
    .start()

query.awaitTermination()

For each mini batch, I must retrieve times and ids in order to apply a global operation on them. But this code fails because I apply collect() on ds.

pyspark.sql.utils.AnalysisException: u'Queries with streaming sources must be executed with writeStream.start();;\nFileSource[data/]'

I tried to add writeStream.start().awaitTermination() to times, but it didn't solve the problem.

Mozimaki
  • 187
  • 1
  • 7
  • @user6910411: The recommended answer does not solve my problem at all. I need to convert a column of DataSet to a numpy array by using "collect". The marked duplicate explains how to use UDF, which seems to be inappropriate in my case. – Mozimaki Nov 23 '18 at 16:07
  • @ Mozimaki While it is unfortunate, the answer correctly describes what can be done with Structured Streaming API, and what you're trying to do (conversion to RDD and collect) is simply not supported. II you want to solve a specific problem, I would recommend asking a separate question, describing what exactly you are trying to achieve. I've already seen [your other question](https://stackoverflow.com/q/53450514/6910411) which once again asks for [unsupported operation](https://stackoverflow.com/questions/46036845/how-to-apply-lag-function-on-streaming-dataframe#comment79036675_46036845). – zero323 Nov 23 '18 at 17:29

0 Answers0