1

I have a Google Dataflow job that reads data from PubSub, aggregates de data and in the end, sends the data to an InflluxDB. What I want to achieve is to aggregate the data in windows of 1 minute but to have only an entry in the DB for each minute. The problem is that I want to allow lateness data so I need to accumulate the data during a period of 5 minutes and then to send to the DB a unique entry.

Is it possible? I tried to do that with the below code, but I don't get what I want:

input.apply(Window
  .<KV<String, String>>into(FixedWindows.of(Duration.standardMinutes(1)))
  .triggering(
      AfterProcessingTime
        .pastFirstElementInPane()
        .plusDelayOf(Duration.standardMinutes(5)))
  .withAllowedLateness(Duration.standardMinutes(5))
  .discardingFiredPanes()
Anton
  • 2,431
  • 10
  • 20
bsmarcosj
  • 1,590
  • 1
  • 11
  • 21

1 Answers1

1

I already collaborated on a similar question. You can use .triggering(Never.ever()) to omit sending the ON TIME panes. Then, as you are already doing, set the allowed lateness to 5 minutes for late records.

It's also important to change the Window.ClosingBehavior to FIRE_ALWAYS. This way we account for the case where there is no late data but we haven't emitted the on-time records. Once the window is closed it will always emit a final pane with PaneInfo.isLast set to true.

So, for your case, the code would be something like:

input.apply(Window
  .<KV<String, String>>into(FixedWindows.of(Duration.standardMinutes(1)))
  .triggering(Never.ever())
  .withAllowedLateness(Duration.standardMinutes(5), Window.ClosingBehavior.FIRE_ALWAYS)
  .discardingFiredPanes()
Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35