2

I have an Onyx stream of segments that are messages with a timestamp (coming in in chronological order). Say, they look like this:

{:id 1 :timestamp "2018-09-04 13:15:42" :msg "Hello, World!"}
{:id 2 :timestamp "2018-09-04 21:32:03" :msg "Lorem ipsum"}
{:id 3 :timestamp "2018-09-05 03:01:52" :msg "Dolor sit amet"}
{:id 4 :timestamp "2018-09-05 09:28:16" :msg "Consetetur sadipscing"}
{:id 5 :timestamp "2018-09-05 12:45:33" :msg "Elitr sed diam"}
{:id 6 :timestamp "2018-09-06 08:14:29" :msg "Nonumy eirmod"}
...

For each time window (of one day) in the data, I want to run a computation on the set of all its segments. I.e., in the example, I would want to operate on the segments with ids 1 and 2 (for Sept 4th), next on the ids 3, 4 and 5 (for Sept 5th), and so on.

Onyx offers windows and triggers, and they should do what I want out of the box. If I use a window of :window/type :fixed and aggregate over :window/range [1 :day] with respect to :window/window-key :timestamp, I will aggregate all segments of each day.

To only trigger my computations when all segments of a day have arrived, Onyx offers the trigger behaviour :onyx.triggers/watermark. According to the documentation, it should fire

if the value of :window/window-key in the segment exceeds the upper-bound in the extent of an active window

However, the trigger does not fire, even though I can see that later segments are already coming in and several windows should be full. As a sanity check, I tried a simple :onyx.triggers/segment trigger, which worked as expected.


My failed attempt at creating a minimal example:

I modified the fixed windows toy job to test watermark triggering, and it worked there.

However, I found out that in this toy job, the reason the watermark trigger is firing might be:

Did it close the input channel? Maybe the job just completed which can trigger the watermark too.


Another aspect that interacts with watermark triggering is the distributed work on tasks by peers.

The comments to issue #839 (:trigger/emit not working with :onyx.triggers/watermark) in the Onyx repo pointed me to issue #840 (Watermark doesn't work with Kafka topic having > 1 partition), where I found this clue (emphasis mine):

The problem is that all of your data is ending up on one partition, and the watermarks always takes the minimum watermark over all of the input peers (and if using the native kafka watermarks, the minimum watermark for a given peer).

As you call g/send with small amounts of data, and auto partition assignment, all of your data is ending up on one partition, meaning that the other partition's peer continues emitting a watermark of 0.


I found out that:

It’s impossible to use it with the current watermark trigger, which relies on the input source. You could try to pull the previous watermark implementation [...]

In my task graph, however, the segments I want to aggregate in windows, are only created in some intermediate task, they don't originate from the input task as such. The input segments only provide information how to create/retrieve the content of the segments to that intermediate task.

Again, this constructs works fine in above mentioned toy job. The reason is that the input channel is closed at some point, which ends the job, which in turn triggers the watermark. So my toy example is actually not a good model, because it is not an open-ended stream.

If a job does get the segments in question from an actual input source, but without timestamps, Onyx seems to provide room to specify a assign-watermark-fn, which is an optional attribute of an input task. That function sets the watermark on each arrival of a new segment. In my case, this does not help, since the segments do not originate from an input task.

Community
  • 1
  • 1
Lutz Büch
  • 343
  • 4
  • 12

1 Answers1

0

I came up with a work-around myself now. The documentation basically gives a clue how that can be done:

This is a shortcut function for a punctuation trigger that fires when any piece of data has a time-based window key that is above another extent, effectively declaring that no more data for earlier windows will be arriving.

So I changed the task that emits the segments so that for every segment there will be emitted another "sentinel" like segment as well:

[{:id 1 :timestamp "2018-09-04 13:15:42" :msg "Hello, World!"}
{:timestamp "2018-09-03 13:15:42" :over :out}]

Note that the :timestamp is predated by the window range (here, 1 day). So it will be sent to the previous window. Since my data comes in chronologically, a :punctuation trigger can tell from the presence of a "sentinel" segment (with keyword :over) that the window can be closed. Don't forget to evict (i.e., :trigger/post-evictor [:all]) and throw away the "sentinel" segment from the final window. Adding :onyx/max-peers 1 in the task map makes sure that a sentinel always arrives eventually, especially when using grouping.

Note that two assumptions go into this work-around:

  1. The data comes in chronological
  2. There are no windows without segments
marco.m
  • 4,573
  • 2
  • 26
  • 41
Lutz Büch
  • 343
  • 4
  • 12