How to process a KStream in a batch of max size or fallback to a time window?

Question

I would like to create a Kafka stream-based application that processes a topic and takes messages in batches of size X (i.e. 50) but if the stream has low flow, to give me whatever the stream has within Y seconds (i.e. 5).

So, instead of processing messages one by one, I process a List[Record] where the size of the list is 50 (or maybe less).

This is to make some I/O bound processing more efficient.

I know that this can be implemented with the classic Kafka API but was looking for a stream-based implementation that can also handle offset committing natively, taking errors/failures into account. I couldn't find anything related int he docs or by searching around and was wondering if anyone has a solution to this problem.

an equivalent functionality would be the akka groupedWithin stream function https://doc.akka.io/docs/akka/2.5/stream/operators/Source-or-Flow/groupedWithin.html — jimkont, Dec 07 '18 at 07:49

Vasyl Sarzhynskyi · Answer 1 · 2020-06-27T13:51:05.577

@Matthias J. Sax answer is nice, I just want to add an example for this, I think it might be useful for someone. let's say we want to combine incoming values into the following type:

public class MultipleValues { private List<String> values; }

To collect messages into batches with max size, we need to create transformer:

public class MultipleValuesTransformer implements Transformer<String, String, KeyValue<String, MultipleValues>> {
    private ProcessorContext processorContext;
    private String stateStoreName;
    private KeyValueStore<String, MultipleValues> keyValueStore;
    private Cancellable scheduledPunctuator;

    public MultipleValuesTransformer(String stateStoreName) {
        this.stateStoreName = stateStoreName;
    }

    @Override
    public void init(ProcessorContext processorContext) {
        this.processorContext = processorContext;
        this.keyValueStore = (KeyValueStore) processorContext.getStateStore(stateStoreName);
        scheduledPunctuator = processorContext.schedule(Duration.ofSeconds(30), PunctuationType.WALL_CLOCK_TIME, this::doPunctuate);
    }

    @Override
    public KeyValue<String, MultipleValues> transform(String key, String value) {
        MultipleValues itemValueFromStore = keyValueStore.get(key);
        if (isNull(itemValueFromStore)) {
            itemValueFromStore = MultipleValues.builder().values(Collections.singletonList(value)).build();
        } else {
            List<String> values = new ArrayList<>(itemValueFromStore.getValues());
            values.add(value);
            itemValueFromStore = itemValueFromStore.toBuilder()
                    .values(values)
                    .build();
        }
        if (itemValueFromStore.getValues().size() >= 50) {
            processorContext.forward(key, itemValueFromStore);
            keyValueStore.put(key, null);
        } else {
            keyValueStore.put(key, itemValueFromStore);
        }
        return null;
    }

    private void doPunctuate(long timestamp) {
        KeyValueIterator<String, MultipleValues> valuesIterator = keyValueStore.all();
        while (valuesIterator.hasNext()) {
            KeyValue<String, MultipleValues> keyValue = valuesIterator.next();
            if (nonNull(keyValue.value)) {
                processorContext.forward(keyValue.key, keyValue.value);
                keyValueStore.put(keyValue.key, null);
            }
        }
    }

    @Override
    public void close() {
        scheduledPunctuator.cancel();
    }
}

and we need to create key-value store, add it to StreamsBuilder, and build KStream flow using transform method

Properties props = new Properties();
...
Serde<MultipleValues> multipleValuesSerge = Serdes.serdeFrom(new JsonSerializer<>(), new JsonDeserializer<>(MultipleValues.class));
StreamsBuilder builder = new StreamsBuilder();
String storeName = "multipleValuesStore";
KeyValueBytesStoreSupplier storeSupplier = Stores.persistentKeyValueStore(storeName);
StoreBuilder<KeyValueStore<String, MultipleValues>> storeBuilder =
        Stores.keyValueStoreBuilder(storeSupplier, Serdes.String(), multipleValuesSerge);
builder.addStateStore(storeBuilder);

builder.stream("source", Consumed.with(Serdes.String(), Serdes.String()))
        .transform(() -> new MultipleValuesTransformer(storeName), storeName)
        .print(Printed.<String, MultipleValues>toSysOut().withLabel("transformedMultipleValues"));
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), props);
kafkaStreams.start();

with such approach we used the incoming key for which we did aggregation. if you need to collect messages not by key, but by some message's fields, you need the following flow to trigger rebalancing on KStream (by using intermediate topic):

.selectKey(..)
.through(intermediateTopicName)
.transform( ..)

Do you think there would be any performance improvement in using this to process kafka streams records? — blake_griffin, Feb 04 '22 at 04:49
I haven't measured performance of this approach in comparison with other alternatived — Vasyl Sarzhynskyi, Feb 06 '22 at 14:40
@VasylSarzhynskyi In doPunctuate don't you need to check the time? Let's assume punctuate is called at times 00:00:00, 00:00:30, 00:01:00 etc and a message arrives on 00:00:28, then at 00:00:30 it will be forwarded even though it has only been 2 seconds since it arrived — mich8bsp, Jun 28 '22 at 12:19

score 4 · Accepted Answer · answered Dec 05 '18 at 18:32

4

The simplest way might be, to use a stateful transform() operation. Each time you receive a record, you put it into the store. When you have received 50 records, you do your processing, emit output, and delete the records from the store.

To enforce processing if you don't read the limit in a certain amount of time, you can register a wall-clock punctuation.

answered Dec 05 '18 at 18:32

Matthias J. Sax

59,682
7
117
137

Sounds like a straightforward workaround but probably still involves some I/O for maintaining state in the store. Definitely worth testing – jimkont Dec 07 '18 at 07:52
1

For small state like this, you can keep it in memory, too. Only make sure it get's logged to a changelog topic -- but this overhead should be minimal if you enable caching. – Matthias J. Sax Dec 07 '18 at 22:51
(maybe newbie) follow up question for the in-memory store: should the store name be unique per consumer or unique per consumer group (application-id)? asking for cases of error or re-partitioning due to consumer scaling – jimkont Dec 10 '18 at 12:40
Store name is unique per consumer group / application.id – Matthias J. Sax Dec 11 '18 at 07:41
thanks, what is not easy to understand is if this in-memory-store will be shared with all application consumers. i.e. the messages that will be temporarily stored for grouping by consumer A, will it affect consumer B that is running in parallel and is also storing messages for grouping? e.g. the 50 grouped messages are written from a single consumer or multiple ones and do I need to take care of race conditions during checking size for processing? – jimkont Dec 11 '18 at 08:25
1

Stores are sharded. Cf. https://stackoverflow.com/questions/40274884/is-kafka-stream-statestore-global-over-all-instances-or-just-local – Matthias J. Sax Dec 11 '18 at 11:47
Wait, don't you want the state store to be unique per partition ? i.e. each partition gets its own store. – blake_griffin Feb 07 '22 at 08:31
Yes, that is how it's done. Why do you think differently? (To to more precise: don't mix up if we discuss at the logical vs physical level -- a single logical state store, is physically sharded). Logically a state store can be shared across Processors, implying that each physical shared is shared between the two Processor instantiations the process the same input topic partition. To shards are still isolated. – Matthias J. Sax Feb 07 '22 at 17:38

Sergey Shcherbakov · Answer 3 · 2021-04-04T22:24:16.120

It seems that there is no need to use Processors or Transformers and transform() to batch events by count. Regular groupBy() and reduce()/aggregate() should do the trick:

KeyValueSerde keyValueSerde = new KeyValueSerde();  // simple custom Serde
final AtomicLong batchCount = new AtomicLong(0L);
myKStream
    .groupBy((k,v) -> KeyValue.pair(k, batchCount.getAndIncrement() / batchSize),
        Grouped.keySerde(keyValueSerde))
    .reduce(this::windowReducer)     // <-- how you want to aggregate values in batch
    .toStream()
    .filter((k,v) -> /* pass through full batches only */)
    .selectKey((k,v) -> k.key)
    ...

You'd also need to add straightforward Serde for the standard KeyValue<String, Long>.

This option is obviously only helpful when you don't need a "punctuator" to emit incomplete batches on timeout. It also doesn't guarantee the order of elements in the batch in case of distributed processing.

You can also concatenate count to the key string to form the new key (instead of using KeyValue). That would simplify example even further (to using Serdes.String()).

1. Where do you specify the batch size here? 2. How do you then get the batched elements from this resultant stream? — blake_griffin, Feb 01 '22 at 02:56
1. By setting the "batchSize" variable and verifying in the filter section that you only return true when the batch has desired count 2. You'd need to extend your Value type to contain a set or list for the batch elements and add to that list in the reducer. — Sergey Shcherbakov, Feb 01 '22 at 19:18

How to process a KStream in a batch of max size or fallback to a time window?

3 Answers3

Linked