I have an Onyx stream of segments that are messages with a timestamp (coming in in chronological order). Say, they look like this:
{:id 1 :timestamp "2018-09-04 13:15:42" :msg "Hello, World!"}
{:id 2 :timestamp "2018-09-04 21:32:03" :msg "Lorem ipsum"}
{:id 3 :timestamp "2018-09-05 03:01:52" :msg "Dolor sit amet"}
{:id 4 :timestamp "2018-09-05 09:28:16" :msg "Consetetur sadipscing"}
{:id 5 :timestamp "2018-09-05 12:45:33" :msg "Elitr sed diam"}
{:id 6 :timestamp "2018-09-06 08:14:29" :msg "Nonumy eirmod"}
...
For each time window (of one day) in the data, I want to run a computation on the set of all its segments. I.e., in the example, I would want to operate on the segments with ids 1 and 2 (for Sept 4th), next on the ids 3, 4 and 5 (for Sept 5th), and so on.
Onyx offers windows and triggers, and they should do what I want out of the box. If I use a window of :window/type :fixed
and aggregate over :window/range [1 :day]
with respect to :window/window-key :timestamp
, I will aggregate all segments of each day.
To only trigger my computations when all segments of a day have arrived, Onyx offers the trigger behaviour :onyx.triggers/watermark
. According to the documentation, it should fire
if the value of
:window/window-key
in the segment exceeds the upper-bound in the extent of an active window
However, the trigger does not fire, even though I can see that later segments are already coming in and several windows should be full. As a sanity check, I tried a simple :onyx.triggers/segment
trigger, which worked as expected.
My failed attempt at creating a minimal example:
I modified the fixed windows toy job to test watermark triggering, and it worked there.
However, I found out that in this toy job, the reason the watermark trigger is firing might be:
Did it close the input channel? Maybe the job just completed which can trigger the watermark too.
Another aspect that interacts with watermark triggering is the distributed work on tasks by peers.
The comments to issue #839 (:trigger/emit
not working with :onyx.triggers/watermark
) in the Onyx repo pointed me to issue #840 (Watermark doesn't work with Kafka topic having > 1 partition), where I found this clue (emphasis mine):
The problem is that all of your data is ending up on one partition, and the watermarks always takes the minimum watermark over all of the input peers (and if using the native kafka watermarks, the minimum watermark for a given peer).
As you call g/send with small amounts of data, and auto partition assignment, all of your data is ending up on one partition, meaning that the other partition's peer continues emitting a watermark of 0.
I found out that:
It’s impossible to use it with the current watermark trigger, which relies on the input source. You could try to pull the previous watermark implementation [...]
In my task graph, however, the segments I want to aggregate in windows, are only created in some intermediate task, they don't originate from the input task as such. The input segments only provide information how to create/retrieve the content of the segments to that intermediate task.
Again, this constructs works fine in above mentioned toy job. The reason is that the input channel is closed at some point, which ends the job, which in turn triggers the watermark. So my toy example is actually not a good model, because it is not an open-ended stream.
If a job does get the segments in question from an actual input source, but without timestamps, Onyx seems to provide room to specify a assign-watermark-fn
, which is an optional attribute of an input task. That function sets the watermark on each arrival of a new segment. In my case, this does not help, since the segments do not originate from an input task.