0

I have a pipeline that's getting a stream of events from PubSub, applying a 1h window and then writing them to a file on Google Cloud Storage. Recently I realised sometimes there are way too many events coming in a 1h window so I also added a trigger that fires if there's more than 100k events sitting in the pane. Now the problem is, the 100k limit is triggered only when a single group inside the window exceeds the number, but not the whole pipeline.

The relevant part of the pipeline looks like this:

PCollection<String> rawEvents = pipeline
   .apply("Read PubSub Events",
       PubsubIO.readStrings()
               .fromSubscription(options.getInputSubscription()));

rawEvents
   .apply("1h Window",
       Window.<String>into(FixedWindows.of(Duration.standardHours(1))
          .triggering(
              Repeatedly
                 .forever(
                    AfterFirst.of(
                       AfterPane.elementCountAtLeast(100000),
                       AfterWatermark.pastEndOfWindow())))
                 .discardingFiredPanes()
                 .withAllowedLateness(Duration.standardDays(7), 
              Window.ClosingBehavior.FIRE_IF_NON_EMPTY)
          .withOnTimeBehavior(Window.OnTimeBehavior.FIRE_IF_NON_EMPTY))
   .apply("Write File(s)", new WriteFiles(options, new EventPartitioner()));

The WriteFiles component is a PTransform that expands to FileIO.Write and it is grouping the elements by a key.

How would I make it so that the window will trigger after there's a total of 100k events sitting in the pipeline and not 100k events for a specific group? Thanks in advance!

Anton
  • 2,431
  • 10
  • 20
rgngl
  • 5,353
  • 3
  • 30
  • 34
  • Triggers are only defined for windows as a mechanism to signal when it is ok to emit the results accumulated in the window so far. They don't make much sense outside of the context of windows. So in your case you're telling that for each element collection within 1hr it's ok to emit such collection when it's more than 100k elements. – Anton May 21 '19 at 16:33
  • The pipeline usually has a `GlobalWindow` assigned to all elements by default (e.g. in `PubsubIO`). You can try setting up the desired behavior in terms of `GlobalWindow` plus desired triggers and then re-window after that, however at that point the logic can become more complicated. – Anton May 21 '19 at 16:35

0 Answers0