6

I am working with Chicago Traffic Tracker dataset, where new data is published every 15 minutes. When new data is available, it represents records off by 10-15 minutes from the "real time" (example, look for _last_updt).

For example, at 00:20, I get data timestamped 00:10; at 00:35, I get from 00:20; at 00:50, I get from 00:40. So the interval that I can get new data "fixed" (every 15 minutes), although the interval on timestamps change slightly.

I am trying to consume this data on Dataflow (Apache Beam) and for that I am playing with Sliding Windows. My idea is to collect and work on 4 consecutive datapoints (4 x 15min = 60min), and ideally update my calculation of sum/averages as soon as a new datapoint is available. For that, I've started with the code:

PCollection<TrafficData> trafficData = input        
    .apply("MapIntoSlidingWindows", Window.<TrafficData>into(
        SlidingWindows.of(Duration.standardMinutes(60)) // (4x15)
            .every(Duration.standardMinutes(15))) .     // interval to get new data
        .triggering(AfterWatermark
                        .pastEndOfWindow()
                        .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()))
        .withAllowedLateness(Duration.ZERO)
        .accumulatingFiredPanes());

Unfortunately, looks like when I receive a new datapoint from my input, I do not get a new (updated) result from the GroupByKey that I have after.

Is this something wrong with my SlidingWindows? Or am I missing something else?

tyron
  • 3,715
  • 1
  • 22
  • 36
  • Do you mean you don't get any elements after the first one or you don't get late elements which are added to the window after the first firing? If it's the latter, then it's likely caused by `allowedLateness(Duration.ZERO)`, this will drop all late elements. – Anton Jun 07 '18 at 23:29
  • Hi @Anton, I don't get late elements after first firing, even though the elements should be on the same "window". For example, element arriving at 01:14 that should be included in the window that started at 00:15, but it is not. My understanding of the `allowedLateness` is that setting this to something greater than 0 (let's say, 5min), would allow elements arriving after the projected closure of the window to be included (so if the element from 01:14 arrived just at 01:18, it would still be included on the window closed at 01:15). If my understanding is wrong, please let me know. – tyron Jun 09 '18 at 08:36

2 Answers2

2

One issue may be that the watermark is going past the end of the window, and dropping all later elements. You may try giving a few minutes after the watermark passes:

PCollection<TrafficData> trafficData = input        
    .apply("MapIntoSlidingWindows", Window.<TrafficData>into(
        SlidingWindows.of(Duration.standardMinutes(60)) // (4x15)
            .every(Duration.standardMinutes(15))) .     // interval to get new data
        .triggering(AfterWatermark
                        .pastEndOfWindow()
                        .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane())
                        .withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
        .withAllowedLateness(Duration.standardMinutes(15))
        .accumulatingFiredPanes());

Let me know if this helps at all.

Pablo
  • 10,425
  • 1
  • 44
  • 67
  • I expected this "few minutes after the watermark" to be on the withAllowedLateness that I already included. If that's not the case, would you be able to explain what's the difference between those 2? Thanks! – tyron Jun 09 '18 at 07:59
  • In your code, `withAllowedLateness` received `Duration.ZERO`, meaning that any late element will be ignored by your pipeline. You can pass a duration of more than zero to let your pipeline wait longer for more elements in that window. LMK if that helps. – Pablo Jun 10 '18 at 00:33
  • Ok, dumb question: if you look at the [example](https://data.cityofchicago.org/resource/n4j6-wkkf.json?segmentid=1), you'll see a `_last_updt` field. I import the data using `withTimestampAttribute` on this field. So if a datapoint with time "00:15" reaches my system at "00:30" to my system, do I need this `withAllowedLateness` to address that? I.e. given the delay that I have on my data, should I consider all my data as being late everytime? – tyron Jun 11 '18 at 05:25
  • It's not a stupid question, but could you elaborate on how you did the `withTimestampAttribute`? In Apache beam, given a source is unbounded, the default behavior is to provide a timestamp for each element based on the current time it is received. If you are using something like `c.outputWithTimestamp(c.element(), last_updt)` then beam divides the elements into windows based on the associated ***event time*** for each element. So it would not look at your system time – Haris Nadeem Jun 11 '18 at 14:41
  • ** If you are using something like (where you are manually updating the timestamp) `c.outputWithTimestamp(c.element(), last_updt)` – Haris Nadeem Jun 11 '18 at 14:47
  • My source is unbounded, yeah. So, I have custom script that every 1 minute read from the API source, parses the information (extracting the `_last_updt`) and sends the information to a PubSub queue with a custom attribute ("timestamp_ms"). Then on Apache Beam I read data using `PubSubIO.Read.fromTopic("myTopic").withTimestampAttribute("timestamp_ms")`. Does it make sense? Do you believe I should then consider all my data as late always (since on "processing time" the data will always be at least 15 minutes behind)? – tyron Jun 11 '18 at 17:17
  • No you should not consider your data arriving late since your data will always be windowed based on your "timestamp_ms" and not the system time. Hope that makes sense? But If you want, you can always try it out on a small sample (of maybe 100 lines) and create fake time stamps that are distant (but stream them in at the same time) and see what happens. The results will be based on the fake time stamps – Haris Nadeem Jun 11 '18 at 19:27
1

So @Pablo (from my understanding) gave the correct answer. But I had some suggestions that would not fit in a comment.

I wanted to ask whether you need sliding windows? From what I can tell, fixed windows would do the job for you and be computationally simpler as well. Since you are using accumulating fired panes, you don't need to use a sliding window since your next DoFn function will already be doing an average from the accumulated panes.

As for the code, I made changes to the early and late firing logic. I also suggest increasing the windowing size. Since you know the data comes every 15 minutes, you should be closing the window after 15 minutes rather than on 15 minutes. But you also don't want to pick a window which will eventually collide with multiples of 15 (like 20) because at 60 minutes you'll have the same problem. So pick a number that is co-prime to 15, for example 19. Also allow for late entries.

    PCollection<TrafficData> trafficData = input        
        .apply("MapIntoFixedWindows", Window.<TrafficData>into(
            FixedWindows.of(Duration.standardMinutes(19)) 
                        .triggering(AfterWatermark.pastEndOfWindow()
                            // fire the moment you see an element 
                            .withEarlyFirings(AfterPane.elementCountAtLeast(1))
                            //this line is optional since you already have a past end of window and a early firing. But just in case 
                            .withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
                        .withAllowedLateness(Duration.standardMinutes(60))
                        .accumulatingFiredPanes());

Let me know if that solves your issue!

EDIT

So, I could not understand how you computed the above example, so I am using a generic example. Below is a generic averaging function:

public class AverageFn extends CombineFn<Integer, AverageFn.Accum, Double> {
  public static class Accum {
    int sum = 0;
    int count = 0;
  }

  @Override
  public Accum createAccumulator() { return new Accum(); }

  @Override
  public Accum addInput(Accum accum, Integer input) {
      accum.sum += input;
      accum.count++;
      return accum;
  }

  @Override
  public Accum mergeAccumulators(Iterable<Accum> accums) {
    Accum merged = createAccumulator();
    for (Accum accum : accums) {
      merged.sum += accum.sum;
      merged.count += accum.count;
    }
    return merged;
  }

  @Override
  public Double extractOutput(Accum accum) {
    return ((double) accum.sum) / accum.count;
  }
}

In order to run it you would add the line:

PCollection<Double> average = trafficData.apply(Combine.globally(new AverageFn()));

Since you are currently using accumulating firing triggers, this would be the simplest coding way to solve the solution.

HOWEVER, if you want to use a discarding fire pane window, you would need to use a PCollectionView to store the previous average and pass it as a side input to the next one in order to keep track of the values. This is a little more complex in coding but would definitely improve performance since constant work is done every window, unlike in accumulating firing.

Does this make enough sense for you to generate your own function for discarding fire pane window?

Haris Nadeem
  • 1,322
  • 11
  • 24
  • HI @Haris, thanks for explanation. If I use fixed windows of size 19, each window would only have 1 element, no? I didn't understand how that would give me the "moving average on the last hour" for my data. Tbh I'm wondering if I should use `discardingFiredPanes` instead of accumulating for that logic... – tyron Jun 11 '18 at 05:30
  • Using `discardingFiredPanes` would be computationally less expensive and a good long term decision bit would require you to restructure your logic for the moving average. – Haris Nadeem Jun 11 '18 at 14:42
  • Could you give me an example of something for a moving average in your use case? Or I could just give you a generic function as an example on implementing a moving average. – Haris Nadeem Jun 11 '18 at 14:43
  • And if you use 19 minutes, on average you will have 1 element in your list, but at most you will have 2 elements in your window. 1/5th of the time you will have two elements and 4/5th of the time you will have one element in the window. If you want, I can explain that logic. – Haris Nadeem Jun 11 '18 at 14:46
  • an example for my calculation of moving average: [link](https://docs.google.com/spreadsheets/d/1LILMxv4OT5t9mGwhOI_57WRptm22k03EO8oZxZH4yMI/edit?usp=sharing). Basically my average will be based on the sum of last 4 or 5 values (divided by 4 or 5). I say "4 or 5" because sometimes I get 4 distinct values within 1 hour, sometimes I get 5. So I cannot set my window to fire after X number of elements, rather I need to do it based on the elements times. – tyron Jun 11 '18 at 21:28
  • I didn't understand your calculations sadly. Let me know if you have more questions. – Haris Nadeem Jun 12 '18 at 05:54
  • @tyron I saw that I got the reward points for the question, so thanks! But that being said, was your use case resolved from this? If not let me know, I'd love to help you hash it out if I can be of help! If this worked out then great! :) – Haris Nadeem Jun 15 '18 at 02:42