-1

The windowing section of the Beam programming model guide shows a window defined and used in the GroupyByKey transform after a ParDo. (section 7.1.1).

How long does a window remain in scope for an element?

Let's imagine a pipeline like this:

my_pcollection = p | MySourceOfData()

results_pcoll = (my_pcollection
                 | beam.WindowInto(..., triggering=...)
                 | beam.GroupByKey()
                 | beam.ParDo(DoSomeFormattingFn())
                 | beam.Combine.Globally(sum))

Suppose that the first window is aggregating by key, but in the second window you may want to combine elements across all keys.

How will the results_pcoll look like? Will it be windowed? Will it be per-key?

Pablo
  • 10,425
  • 1
  • 44
  • 67

1 Answers1

1

In Beam, it is important to remember that every element has a window associated to it.

In the code snippet, elements in my_pcollection are associated to the Global Window. When you add a beam.WindowInto, you're adding a window to each element - and when they go into the GroupByKey, elements will be grouped by both: key, and window.


When you go downstream to a ParDo, and a Combine, your elements continue to have the same window, and the same trigger.

This happens because Beam tries to allow your data to continue to flow through your pipeline, so it keeps the same window and trigger semantics.


With those considerations, your results_pcoll will have the same window, and the same trigger semantics that you added at the beginning of the pipeline.

By combining Globally, you'll have a single aggregation over all keys, but you will also have one aggregation for each window.

Pablo
  • 10,425
  • 1
  • 44
  • 67