10

For many situations in Big Data it is preferable to work with a small buffer of records at a go, rather than one record at a time.

The natural example is calling some external API that supports batching for efficiency.

How can we do this in Kafka Streams? I cannot find anything in the API that looks like what I want.

So far I have:

builder.stream[String, String]("my-input-topic")
.mapValues(externalApiCall).to("my-output-topic")

What I want is:

builder.stream[String, String]("my-input-topic")
.batched(chunkSize = 2000).map(externalBatchedApiCall).to("my-output-topic")

In Scala and Akka Streams the function is called grouped or batch. In Spark Structured Streaming we can do mapPartitions.map(_.grouped(2000).map(externalBatchedApiCall)).

samthebest
  • 30,803
  • 25
  • 102
  • 142
  • Why not just schedule the processing which will then read up to chunkSize records from the stream? – daniu Sep 17 '18 at 12:01
  • A side note to your question, is that calling external APIs from a streams processor is not always the best pattern. Sometimes you'll find that the external data is best brought into Kafka itself (e.g. CDC from databases, mainframes,etc) as its own topic, and then easily joined within the stream processing itself. – Robin Moffatt Sep 17 '18 at 12:11
  • mapPartitions in Spark doesn't guarantee partition size. Only the streaming duration can affect the window size – OneCricketeer Sep 17 '18 at 13:02
  • 1
    As @RobinMoffatt mentioned, it might be better to load the external data into a Kafka topic, read it as a KTable into your application and do a stream-table join instead of an external API call. – Matthias J. Sax Sep 17 '18 at 16:23
  • Besides this, you could use `transform()` with an attached `state` and build up the batches manually. If, if state size smaller than 200, put record into store. If you hit 200 records, extract all data, do the external API call---note, you need to do it synchronously---, and clear the store. – Matthias J. Sax Sep 17 '18 at 16:25
  • @MatthiasJ.Sax In my case, I've a StateStore like this: `KeyValueStore>` -- in every punctuation, I check the size of these lists and any of them higher than the threshold will get sent to the 3rd party API. The question is: How do you clear the store securely while you have incoming data? – Alper Kanat Dec 18 '18 at 09:36
  • "How do you clear the store securely while you have incoming data?" -- not sure what you mean here? Can you elaborate? – Matthias J. Sax Dec 18 '18 at 10:57
  • My app reads/writes from/to a state store (`KeyValueStore>`) using WALL_CLOCK_TIME. Let's say every 10 secs, I check the store for lists larger than the threshold and send them to a remote API. Then I reset the corresponding list and commit the state. While I do this, new messages keep coming in and `process` method continues to write to the state store maybe into the list I'm working on. I guess this can't happen if I have a single thread. Otherwise, I'll have to place a lock mechanism, is that correct? – Alper Kanat Dec 18 '18 at 13:04

3 Answers3

6

Doesn't seem to exist yet. Watch this space https://issues.apache.org/jira/browse/KAFKA-7432

samthebest
  • 30,803
  • 25
  • 102
  • 142
3

you could use a queue. something like below,

@Component
@Slf4j
public class NormalTopic1StreamProcessor extends AbstractStreamProcessor<String> {

    public NormalTopic1StreamProcessor(KafkaStreamsConfiguration configuration) {
        super(configuration);
    }

    @Override
    Topology buildTopology() {
        KStream<String, String> kStream = streamsBuilder.stream("normalTopic", Consumed.with(Serdes.String(), Serdes.String()));
        // .peek((key, value) -> log.info("message received by stream 0"));
        kStream.process(() -> new AbstractProcessor<String, String>() {
            final LinkedBlockingQueue<String> queue = new LinkedBlockingQueue<>(100);
            final List<String> collection = new ArrayList<>();

            @Override
            public void init(ProcessorContext context) {
                super.init(context);
                context.schedule(Duration.of(1, ChronoUnit.MINUTES), WALL_CLOCK_TIME, timestamp -> {
                    processQueue();
                    context().commit();
                });
            }

            @Override
            public void process(String key, String value) {
                queue.add(value);
                if (queue.remainingCapacity() == 0) {
                    processQueue();
                }
            }

            public void processQueue() {
                queue.drainTo(collection);
                long count = collection.stream().peek(System.out::println).count();
                if (count > 0) {
                    System.out.println("count is " + count);
                    collection.clear();
                }
            }
        });
        kStream.to("normalTopic1");
        return streamsBuilder.build();
    }

}
Rajesh Rai
  • 81
  • 10
  • 2
    Worth noting that commits occur even if the processor never calls `context.commit()`, thus this implementation may commit offsets that have never been prcessed by `processQueue()` See: https://stackoverflow.com/q/54075610/1011662 – matrix10657 Oct 23 '20 at 16:17
0

I suspect, if Kafka stream supports fixed size windows like other tools at the moment.
But there are Time based windows, supported by kafka streams. https://kafka.apache.org/11/documentation/streams/developer-guide/dsl-api.html#windowing

Instead of number of records, you can define the window size with time.

  1. Tumbling time windows
  2. Sliding time window
  3. Session window
  4. Hopping time window

In your case, Tumbling Time Window can be an option to use. Those are non-overlapping, fixed size time window.

For example, tumbling windows with a size of 5000ms have predictable window boundaries [0;5000),[5000;10000),... — and not [1000;6000),[6000;11000),... or even something “random” like [1452;6452),[6452;11452),....

Nishu Tayal
  • 20,106
  • 8
  • 49
  • 101
  • Seems like we are on the right track with this approach, but windowing seems to be something relevant only to joins and aggregations. We want to do a map. How would do a map operation via an aggregation?? – samthebest Sep 17 '18 at 14:53