1

My use case involves my application receiving events, some of which we can expect to arrive out of time order (up to 2 days after the 'event' time) which I need to group by key. I don't want to aggregate the records but simply get an ordered list of those events time window at a time. I doubt using an aggregation function to build a List of events will work as I am very likely to get a RecordTooLargeException as there are likely to be thousands of events per key/time window.

The code I played with creates a Tumbling Window with a 2 day grace period - which works in principle but required me to use an aggregation - and I feel my use case of building a list of messages is going beyond what the aggregate function was originally intended for - e.g.

stream.stream[Key, Entry](inputTopic)(Consumed.`with`[EntryKey, Entry](timestampExtractor))
      .groupByKey(Grouped.`with`(keySerde, valueSerde))
      .windowedBy(TimeWindows.of(windowSize).grace(windowGrace))
      .aggregate[EntryGroup](
        EntryGroup(Seq.empty[Entry])
      )((_: EntryKey, newValue: Entry, aggregate: EntryGroup) => EntryGroup(aggregate.anprEntries :+ newValue))
      .suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))

Is there an idiomatic way of passing downstream ordered groups of events for a key/timewindow without running into record size issues?

1 Answers1

1

I would recommend to fall back to the Processor API: using a WindowStore (with allowDuplicates enabled) allows you to buffer all records (one record per "row" in the store).

Thus you can just put incoming records into the store and emit "old records" from the store (and delete them from the store) when time progresses.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Thanks @matthias-j-sax. I presume we would create a WindowStore with an arbitrarily large retentionPeriod to ensure values are not purged before the grace period expires - i.e. in my use case 2 days. Then use a punctuator to iterate the store and emit records that have reached the grace period in age? – Nicholas Lester Mar 01 '21 at 20:12
  • Yes, that should work. -- Using punctuations or not might be design decision: you could also emit expired data each time you add new record within `process()`. – Matthias J. Sax Mar 01 '21 at 21:38
  • Perfect. I will put a null value into the state store for the key and timestamp in question once it has been emitted. Looks like that is as good as a delete.. Thanks so much! – Nicholas Lester Mar 02 '21 at 08:54
  • Hi @matthias-j-sax. I've come back to complete this, and the difficulty I have observed when using a WindowStore with retainDuplicates enabled is that I don't think there is a clear way to remove an key/value. Where I would normally put a null value for a given key, the javadoc for Stores.persistenWindowStore seems to suggest this is not possible (and I'd have to delete it from the store once I have emitted the value) - "retainDuplicates - whether or not to retain duplicates. Turning this on will automatically disable caching and means that null values will be ignored." – Nicholas Lester May 12 '21 at 11:04
  • Ah. This seems to be bug -- we actually recently hit it ourself during https://issues.apache.org/jira/browse/KAFKA-10847 -- In the DSL, windowed-stores never call `delete()` but rely only on retention time only, thus the bug was hidden for a long time... It should be fixed in upcoming 3.0 release (cf https://github.com/apache/kafka/pull/10537) – Matthias J. Sax May 12 '21 at 20:32