0

Say I want to calculate the average of certain metric over the last 10 mins, after each minute and compare it to the average of the same metric over the last 20 mins, after each minute. I need 2 windows (Not 10 Sliding windows vs 20 Sliding windows) or 2 windows of Fixed Duration, with early firing. I need 2 windows which should keep rolling forward by a minute (of duration 10 min and 20 min each) every minute. Alternatively, if I could discard all but the latest of the sliding windows, my problem could be solved. Otherwise multiple sliding windows are very costly.

Could you please help here? A custom WindowFn() function would be very helpful

Nishant
  • 395
  • 4
  • 6
  • I had the same question but didn't figure out the way to implement rolling window. If I read Beam code right, windowing does not discard anything. – Rui Wang Feb 20 '19 at 18:20
  • If I have understood the request correctly, would the following approach work? Branch your source PCollection (Call .apply twice) In one branch use Slide window 10 period 1 min In 2nd branch use Slide window 20 period 1 min Do Avg on both branches Re-window on both branches into FixedWindow of 1 min FLatten the branches back together again and do your comparison. – Reza Rokni Feb 22 '19 at 08:06
  • @RezaRokni : Thanks for your answer. This is exactly how Beam would recommend it doing currently. However, the problem is that this creates 20 sliding windows and 10 sliding windows respectively, which is a lot of data to maintain, if I am not worried about late data and can thus discard all but the first sliding window. So the question was to find out, if all the additional windows can be done away with – Nishant Feb 22 '19 at 16:01
  • @RuiWang Could you please review the answer below and opine? – Nishant Feb 22 '19 at 16:40
  • Have you experimented with Doing a (SlidingWindow.10Mins.Period1Min).apply(Avg).apply(Create a Result Datapoint that has timestamp and avg as props) .apply(SlidingWindow.30Min.Duration1Min).(GBK) The end of that pipe will have a Iterable which contains a maximum of 30 values. If you have Mill's of keys that would still be a lots of data points of course. – Reza Rokni Feb 23 '19 at 03:57

1 Answers1

1

I must update with what I ended up doing finally. I created a Global window with AllowedLateness of 1 hour, and triggering every minute repeatedly forever, with Accumulating Panes. From this global window, I applied DoFn filtering for elements with Timestamps in the last 10 mins (Present Instant.minus 10 mins), and events in the last 20 mins (Present Instant.minus 20 mins) to create 2 distinct PCollections. I applied this time filtering twice - once to the trigger output of the global window to add it to the PCollection(s) for 10 min, 20 min and then again to the collection itself to remove all those which are no longer part of the time duration. For now, these 2 PCollection(s) are serving as the rolling window, but I need to audit the results to confirm if this is indeed working.

Nishant
  • 3
  • 1
Nishant
  • 395
  • 4
  • 6
  • I see your way now. So the drawback is there is no guarantee on completeness of your data. My understanding to your original question was you wanted a rolling windowing with all benefits of Beam model for event timestamp semantics. – Rui Wang Feb 22 '19 at 20:16
  • Another question here is, for each time of firing, you will see accumulated data, and do filtering based on that. So I am seeing duplicates between each filtering. Will you consider discarding mode? Another uncertainty for me is time. You do Present instant - 10/20 mins, is the Present instant all the same cross machines because filtering might happen on multiple machines. – Rui Wang Feb 22 '19 at 20:21
  • Never mind, it depends on accuracy requirement but accumulating mode might be just ok to you. You might need to estimate how much data will be accumulated then for this mode. – Rui Wang Feb 22 '19 at 20:30