0

I'm building a data pipeline using Spark Structured Streaming, which reads data from Kafka.

Here is my source code:

queries = []

plug_df = event_df.withWatermark('timestamp', '10 minutes').groupby(
    f.window(f.col('timestamp'), '5 minutes', '5 minutes'),
    f.col('house_id'),
    f.col('household_id'),
    f.col('plug_id')
).agg(
    f.avg('value').alias('avg_load')
)

house_df = plug_df.groupby(
    f.col('house_id'),
    f.col('window')
).agg(
    f.sum('avg_load').alias('avg_load')
)

queries.append(plug_df.writeStream.format('console').outputMode('update').start())
queries.append(house_df.writeStream.format('console').outputMode('update').start())

for query in queries:
    query.awaitTermination()

spark.stop()

The plug_df query works perfectly fine, but when I start the house_df query, it raises the following exception:

pyspark.errors.exceptions.captured.AnalysisException: Detected pattern of possible 'correctness' issue due to global watermark. The query contains a stateful operation that can emit rows older than the current watermark plus the allowed late record delay, which are considered as "late rows" in downstream stateful operations and these rows can be discarded. Please refer to the programming guide documentation for more details. If you understand the potential risk of correctness issues and still need to run the query, you can disable this check by setting the configuration `spark.sql.streaming.statefulOperator.checkCorrectness.enabled` to false.

So my question is, how can I perform multiple aggregations in Spark Structured Streaming? What is the recommended approach to achieve this in Spark Streaming?

Anh Duc Ng
  • 169
  • 1
  • 8

1 Answers1

1

This is indeed not supported. I'm not exactly in your case, as I don't use watermarks, but I have aggregations to do so what I did on my side is using foreachBatch sink instead of console on your side.

In pseudo code:

query.writeStream.foreachBatch(func)
// do aggregation inside foreachBatch
func:
  #aggregate as much as you want

Hope it works for you

Medzila
  • 161
  • 4