how to process data in chunks/batches with kafka streams?

Question

For many situations in Big Data it is preferable to work with a small buffer of records at a go, rather than one record at a time.

The natural example is calling some external API that supports batching for efficiency.

How can we do this in Kafka Streams? I cannot find anything in the API that looks like what I want.

So far I have:

builder.stream[String, String]("my-input-topic")
.mapValues(externalApiCall).to("my-output-topic")

What I want is:

builder.stream[String, String]("my-input-topic")
.batched(chunkSize = 2000).map(externalBatchedApiCall).to("my-output-topic")

In Scala and Akka Streams the function is called grouped or batch. In Spark Structured Streaming we can do mapPartitions.map(_.grouped(2000).map(externalBatchedApiCall)).

Why not just schedule the processing which will then read up to chunkSize records from the stream? — daniu, Sep 17 '18 at 12:01
A side note to your question, is that calling external APIs from a streams processor is not always the best pattern. Sometimes you'll find that the external data is best brought into Kafka itself (e.g. CDC from databases, mainframes,etc) as its own topic, and then easily joined within the stream processing itself. — Robin Moffatt, Sep 17 '18 at 12:11
mapPartitions in Spark doesn't guarantee partition size. Only the streaming duration can affect the window size — OneCricketeer, Sep 17 '18 at 13:02
As @RobinMoffatt mentioned, it might be better to load the external data into a Kafka topic, read it as a KTable into your application and do a stream-table join instead of an external API call. — Matthias J. Sax, Sep 17 '18 at 16:23
Besides this, you could use `transform()` with an attached `state` and build up the batches manually. If, if state size smaller than 200, put record into store. If you hit 200 records, extract all data, do the external API call---note, you need to do it synchronously---, and clear the store. — Matthias J. Sax, Sep 17 '18 at 16:25
@MatthiasJ.Sax In my case, I've a StateStore like this: `KeyValueStore>` -- in every punctuation, I check the size of these lists and any of them higher than the threshold will get sent to the 3rd party API. The question is: How do you clear the store securely while you have incoming data? — Alper Kanat, Dec 18 '18 at 09:36
"How do you clear the store securely while you have incoming data?" -- not sure what you mean here? Can you elaborate? — Matthias J. Sax, Dec 18 '18 at 10:57
My app reads/writes from/to a state store (`KeyValueStore>`) using WALL_CLOCK_TIME. Let's say every 10 secs, I check the store for lists larger than the threshold and send them to a remote API. Then I reset the corresponding list and commit the state. While I do this, new messages keep coming in and `process` method continues to write to the state store maybe into the list I'm working on. I guess this can't happen if I have a single thread. Otherwise, I'll have to place a lock mechanism, is that correct? — Alper Kanat, Dec 18 '18 at 13:04

score 6 · Accepted Answer · answered Sep 28 '18 at 05:33

6

Doesn't seem to exist yet. Watch this space https://issues.apache.org/jira/browse/KAFKA-7432

answered Sep 28 '18 at 05:33

samthebest

30,803
25
102
142

score 3 · Answer 2 · answered Jul 11 '20 at 21:52

you could use a queue. something like below,

@Component
@Slf4j
public class NormalTopic1StreamProcessor extends AbstractStreamProcessor<String> {

    public NormalTopic1StreamProcessor(KafkaStreamsConfiguration configuration) {
        super(configuration);
    }

    @Override
    Topology buildTopology() {
        KStream<String, String> kStream = streamsBuilder.stream("normalTopic", Consumed.with(Serdes.String(), Serdes.String()));
        // .peek((key, value) -> log.info("message received by stream 0"));
        kStream.process(() -> new AbstractProcessor<String, String>() {
            final LinkedBlockingQueue<String> queue = new LinkedBlockingQueue<>(100);
            final List<String> collection = new ArrayList<>();

            @Override
            public void init(ProcessorContext context) {
                super.init(context);
                context.schedule(Duration.of(1, ChronoUnit.MINUTES), WALL_CLOCK_TIME, timestamp -> {
                    processQueue();
                    context().commit();
                });
            }

            @Override
            public void process(String key, String value) {
                queue.add(value);
                if (queue.remainingCapacity() == 0) {
                    processQueue();
                }
            }

            public void processQueue() {
                queue.drainTo(collection);
                long count = collection.stream().peek(System.out::println).count();
                if (count > 0) {
                    System.out.println("count is " + count);
                    collection.clear();
                }
            }
        });
        kStream.to("normalTopic1");
        return streamsBuilder.build();
    }

}

Worth noting that commits occur even if the processor never calls `context.commit()`, thus this implementation may commit offsets that have never been prcessed by `processQueue()` See: https://stackoverflow.com/q/54075610/1011662 — matrix10657, Oct 23 '20 at 16:17

score 0 · Answer 3 · answered Sep 17 '18 at 11:58

I suspect, if Kafka stream supports fixed size windows like other tools at the moment.
But there are Time based windows, supported by kafka streams. https://kafka.apache.org/11/documentation/streams/developer-guide/dsl-api.html#windowing

Instead of number of records, you can define the window size with time.

Tumbling time windows
Sliding time window
Session window
Hopping time window

In your case, Tumbling Time Window can be an option to use. Those are non-overlapping, fixed size time window.

For example, tumbling windows with a size of 5000ms have predictable window boundaries [0;5000),[5000;10000),... — and not [1000;6000),[6000;11000),... or even something “random” like [1452;6452),[6452;11452),....

Seems like we are on the right track with this approach, but windowing seems to be something relevant only to joins and aggregations. We want to do a map. How would do a map operation via an aggregation?? — samthebest, Sep 17 '18 at 14:53

how to process data in chunks/batches with kafka streams?

3 Answers3