8

Currently using Spark 2.2.0 structured streaming.

Given a stream of timestamped data with watermarking, is there a way to combine (1) the groupBy operation to achieve windowing by the timestamp field and other grouping criteria with (2) the groupByKey operation in order to apply mapGroupsWithState to the groups for custom sessionization?

Or is it that I have to settle with somehow embedding the windowing and other grouping logic into groupByKey?

For context:

  • calling groupBy, which supports windowing, on a Dataset returns a RelationalGroupedDataset which does not have mapGroupsWithState.

  • calling groupByKey, which supports mapGroupsWithState, returns a KeyValueGroupedDataset, but that has no support for windowing!

Edit:

The issue is now tracked by SPARK-21641 - Combining windowing (groupBy) and mapGroupsWithState (groupByKey) in Spark Structured Streaming.

zero323
  • 322,348
  • 103
  • 959
  • 935
tmiu
  • 331
  • 2
  • 7

0 Answers0