0

I am using apache beam to write some streaming pipelines. One requirement for my use case is that i want to trigger every X minutes relative to window start or end time. how can i achieve this. The current trigger AfterProcessingTime.pastFirstElementInPane(), is in relative to the first element's processing time in that window.

For example i created fixed 1 minute windows, so i have window_1 (0-1 min interval), window_2 (1 - 2 min interval) and so on. Now i want the results for each window to be triggered exactly once after 10 minutes since the beginning of window i.e window_1 at 0 + 10 -> 10th minute , window_2 at 11th minutes (1 + 10). [Note: i configure fixed windows to allow lateness of > 10 minutes so the elements are not discarded if delayed]

Is there a way to achieve this kind of triggering for a fixed window.

I cannot just assign all elements to a global window and then do repeated trigger every minute , because then it loses all elements window timing information .For example if there are 2 elements in my pcollection that belong to window_1 and window_2 based on there event timestamp, but were delayed by 3 and 3.2 minutes. Assigning them to global window will generate me some output at the end of 4th minute taking both elements into account, whereas in reality i want them to be assigned to there actual fixed window (as late data).

I want the elements to be assigned to window_1 and window_2 based on there event timestamp and then window_1 triggering at the 10th minute outputting result by processing only 1 late data for that window and then window_2 triggering at 11th minute with output after processing the only element that came 3.2 minutes delayed. What should be my trigger setting to achieve this kind of behavior in my streaming pipeline.

user179156
  • 841
  • 9
  • 31
  • To make sure I understand correctly. Are you asking to have sliding windows. I.e. windows which overlap? i.e. at seconds: [0, 10], [1, 11], [2,12] Consider looking at sliding windows. https://cloud.google.com/dataflow/model/windowing#sliding-time-windows OR are you asking to get data for 10 minute windows. And the handle late data well, i.e. keep emitting whenever you get late data? – Alex Amato May 11 '18 at 20:48
  • For allowed_lateness. If you just want to make sure it eventually gets processed when it arrives, you can set allowedLatness to a certain duration. This maintains your elements so that they can be reprocessed a a group if late data arrives. But make sure you set accumulatingFiredPanes. https://cloud.google.com/dataflow/model/triggers#window-accumulation-modes https://cloud.google.com/dataflow/model/windowing#managing-time-skew-and-late-data – Alex Amato May 11 '18 at 20:57
  • @AlexAmato : i dont mean any of what you said , neither sliding window , neither 10 minute window. To simplify the problem , forget about 10 min , lets say i want to create 1 minute fixed windows and then trigger to happen when the window ends . By window end i dont mean when it is considered end based on watermark progress but i mean the exaact processing time when it ends. so lets say starting from epoch i have 1 minute windows then the current window is say the Nth window, i want the data to be emitted/tigerred for this window at the beginning on N+1th fixed window (end of Nth window) . – user179156 May 12 '18 at 05:08
  • @AlexAmato : does that make sense ?. 10 minute was generalization to that. I want Nth window data to be triggered at N + x time (i.e X minutes after the beginning of Nth window , the window size can be anythin from 1 seconf to 1 hour or more doesn't matter , i just want a trigger relative to window beginning time as reference) .i am only talking about processing time , no event time here or watermark assumption of when window ends here . Just straight up everything in absolute processing time reference. – user179156 May 12 '18 at 05:12

1 Answers1

0

I believe the following code works for you:

pcollection | WindowInto(
    FixedWindows(1 * 60).configure().withAllowedLateness(),
    trigger=AfterProcessingTime(9 * 60),

The size of the window is 1 minute and after 9 minutes it triggers the data. However, for many cases, it is much faster to use sliding window and then take care of the duplicated processed elements. As AlexAmato mentioned Watermarks and AfterWatermark Event time triggers should also work here.

Shahin Vakilinia
  • 355
  • 1
  • 11