0

Events timeline

I am gathering information about the flights. Maximal length of the flight is 10h. I am getting tracking information in about every 1 minute. Events order is disturb during processing in apache beam. After merging all the data I want to push it to BigQuery and discard the data so it doesn't consume memory.

I have 2 strategies how to do this:

1) Wait 1h and if there is no new data coming push it to BQ

2) In every 15 minutes run my own algorithm which verify if data is complete.

I want to go with 1) cause it's simpler. Can my code be correct?:

models = (xmls | beam.FlatMap(process_xmls))
tracking_informations = models | beam.ParDo(FilterTI())
grouped_tis = tracking_informations | beam.WindowInto(window.FixedWindows(10 * 3600), trigger=AfterProcessingTime(1 * 3600), accumulation_mode=AccumulationMode.DISCARDING) | beam.GroupByKey() | "push and merge to BQ"
sacherus
  • 1,614
  • 2
  • 20
  • 27

1 Answers1

1

After reading your use case and the desired approach -grouping all events belonging to the same flight together until you find a gap of inactivity- this seems like a perfect fit for Session windows. In the example, you should be using the flight identifier (f1,f2 and so on as the key) and specifying a gap of 1 hour. If no new events are observed during that time the session will be terminated.

You can use them with beam.WindowInto(window.Sessions(session_gap)) and you can find a full example here (don't forget to add the Group By Key step to actually merge the events together in a single session).

Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35