I have a pipeline that's getting a stream of events from PubSub, applying a 1h window and then writing them to a file on Google Cloud Storage. Recently I realised sometimes there are way too many events coming in a 1h window so I also added a trigger that fires if there's more than 100k events sitting in the pane. Now the problem is, the 100k limit is triggered only when a single group inside the window exceeds the number, but not the whole pipeline.
The relevant part of the pipeline looks like this:
PCollection<String> rawEvents = pipeline
.apply("Read PubSub Events",
PubsubIO.readStrings()
.fromSubscription(options.getInputSubscription()));
rawEvents
.apply("1h Window",
Window.<String>into(FixedWindows.of(Duration.standardHours(1))
.triggering(
Repeatedly
.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(100000),
AfterWatermark.pastEndOfWindow())))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(7),
Window.ClosingBehavior.FIRE_IF_NON_EMPTY)
.withOnTimeBehavior(Window.OnTimeBehavior.FIRE_IF_NON_EMPTY))
.apply("Write File(s)", new WriteFiles(options, new EventPartitioner()));
The WriteFiles
component is a PTransform
that expands to FileIO.Write
and it is grouping the elements by a key.
How would I make it so that the window will trigger after there's a total of 100k events sitting in the pipeline and not 100k events for a specific group? Thanks in advance!