1

The following discussion is in the context of Apache Flink:

Imagine that we have a keyedStream whose key is its id and event time is its timestamp, if we want to calculate how many events arrived within 10 minutes for each event.

The problems need to be solved are:

  1. How to design the window ?
    • We can create a window of 10 minutes after each event arrives, but this mean that for each event, there will be a delay of 10 minutes because the wait for the window of 10 minutes.
    • We can create a window of 10 minutes which takes the timestamp of each event as the maximum timestamp in this window, which means that we don't need to wait for 10 minutes, because we take the last 10 minutes of elements before the element arrives. But this kind of window is not easy to define, as far as I know.
  2. How to deal with memory or other resource issues ? Even we succeed to create a window, maybe the kind of ids of events are diverse, so many window like this, how the system keep their states in the memory ? There is a big possibility of stakoverflow of memory.

Maybe there are some problems that I don't mention here, or maybe there are some good solutions except window(i.e. Patterns). If you have a good solutions, please give me a clue, thank you.

Leyla Lee
  • 466
  • 5
  • 19

1 Answers1

1

You could do this with a GlobalWindow and a Trigger than fires on every event and an Evictor that removes events that are more than 10 minutes old before counting the remaining events. (A naive implementation could easily perform very poorly, however.)

Yes, this may require keeping a lot of state -- you'll be keeping every event from the past 10 minutes (well, you only need to store the timestamp from each event). If you setup the RocksDB state backend then Flink will spill to disk if need be, but with some obvious performance penalty. Probably better to use a cluster large enough to hold 10 minutes of traffic in memory. Even at one million events per second, each with a 32-bit timestamp, that's only 2.4GB in 10 minutes (1 million events per second x 600 seconds x 4 bytes per event) -- doesn't seem like a problem at all.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
  • Hey, thank you for you reply, I think a sliding window can easily solve this problem instead of a GlobalWindow, just have an obvious flaw, because if I use `SlidingWindow`, I will destroy the **content** of streams, if I want to apply other calculation to the streams after this time, I cannot get the original events after the `SlidingWindow` – Leyla Lee Dec 05 '17 at 15:51
  • In that case, I suspect you'd be better off with the flexibility of a `ProcessFunction`. – David Anderson Dec 05 '17 at 19:49
  • Thank you for your reply, I have another question, maybe you can give me some clue ? during a streaming processing, you know, it is normal that we use the windows, but this is a transformation of data, if I want to continue to work the same event for other calculations, how can I do it ? – Leyla Lee Dec 06 '17 at 09:27
  • Take a look at split() and side outputs (https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/side_output.html). – David Anderson Dec 06 '17 at 10:11
  • Salut! I am wondering do you have an email address? I would like to write you a email to discuss it, you gave me some good inspiration – Leyla Lee Dec 06 '17 at 10:51