0

This is a generalization of this question.

Suppose I have multiple source streams for which the same set of predicates apply. I would like to set up branch streams such that records which satisfy the predicate, regardless of which source stream, is processed by the same branch stream. As the diagram below shows each branch stream is like a generic processor which transforms incoming records.

enter image description here

The following code block does not work as it should be since it creates a distinct set of branch streams for each source stream.

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source1 = builder.stream("x");
KStream<String, String> source2 = builder.stream("y");

Predicate<String, String>[] branchPredicates = new Predicate[forkCount];
for (int i = 0; i < forkCount; ++i) {
    int idx = i;
    branchPredicates[i] = ((key, value) ->
        key.hashCode() % forkCount == idx);
}

Kstream<String, String>[] forkStreams = Arrays.asList(source1, source2)
    .map(srcStream -> srcStream.branch(branchPredicates)
    .flatMap(x -> Arrays.stream())
    .collect(Collectors.toList());

sorry, I'm mostly a scala developer :)

In the above example, forkStreams.length == branchPredicates.length x 2 and in general, proportional to the number of source streams. Is there a trick in Kafka stream that allows me to keep a one-to-one relationship between the predicates and fork streams?

UPDATE 11/27/2018 There are some progress in that I can:

  • Read from all source topics using one source stream
  • Connect the source stream to multiple branches
  • Distribute messages evenly amount the branches.

However, as the following code block demonstrates ALL fork streams exist in the same thread. What I would like to achieve is to place each fork stream into a different thread to allow better CPU utilization

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream(Arrays.asList("a", "b", "c")
// Create workers
// Need to have predicates for the branches
int totalPerdicates = Integer
    .parseInt(props.getProperty(WORKER_PROCESSOR_COUNT));
Predicate<String, String>[] predicates = new Predicate[totalPerdicates];
IntStream
    .range(0, totalPerdicates)
    .forEach(i -> {
        int idx = i;
        predicates[i] = (key, value) ->
            key.hashCode() % totalPerdicates == idx;
    });

forkStreams = Arrays.asList(sourceStreams.branch(predicates));

// Hack- Dump the number of messages processed every 10 seconds
forkStreams
    .forEach(fork -> {
        KStream<Windowed<String>, Long> tbl =
        fork.transformValues(new SourceTopicValueTransformerSupplier())
            .selectKey((key, value) -> "foobar")
            .groupByKey()
            .windowedBy(TimeWindows.of(2000L))
            .count()
            .toStream();

        tbl
            .foreach((key, count) -> {
                String fromTo = String.format("%d-%d",
                                              key.window().start(),
                                              key.window().end());
                System.out.printf("(Thread %d, Index %d) %s - %s: %d\n",
                                  Thread.currentThread().getId(),
                                  forkStreams.indexOf(fork),
                                  fromTo, key.key(), count);
            });

Here's a snippet of the output

<snip>
(Thread 13, Index 1) 1542132126000-1542132128000 - foobar: 2870
(Thread 13, Index 1) 1542132024000-1542132026000 - foobar: 2955
(Thread 13, Index 1) 1542132106000-1542132108000 - foobar: 1914
(Thread 13, Index 1) 1542132054000-1542132056000 - foobar: 546
<snip>
(Thread 13, Index 2) 1542132070000-1542132072000 - foobar: 524
(Thread 13, Index 2) 1542132012000-1542132014000 - foobar: 2491
(Thread 13, Index 2) 1542132042000-1542132044000 - foobar: 261
(Thread 13, Index 2) 1542132022000-1542132024000 - foobar: 2823
<snip>
(Thread 13, Index 3) 1542132088000-1542132090000 - foobar: 2170
(Thread 13, Index 3) 1542132010000-1542132012000 - foobar: 2962
(Thread 13, Index 3) 1542132008000-1542132010000 - foobar: 2847
(Thread 13, Index 3) 1542132022000-1542132024000 - foobar: 2797
<snip>
(Thread 13, Index 4) 1542132046000-1542132048000 - foobar: 2846
(Thread 13, Index 4) 1542132096000-1542132098000 - foobar: 3216
(Thread 13, Index 4) 1542132108000-1542132110000 - foobar: 2696
(Thread 13, Index 4) 1542132010000-1542132012000 - foobar: 2881
<snip>

Any suggestions as to how to place each fork stream in a different thread is appreciated.

nads
  • 378
  • 3
  • 13
  • 1
    Why don't you read both topics at once: `StreamsBuilder.stream("x", "y");` – Matthias J. Sax Nov 24 '18 at 05:24
  • There's too much volume for one thread to read from all the topics. I'm trying to avoid pegging the CPU on the source stream. – nads Nov 24 '18 at 17:08
  • 1
    Topics are scaled via partitions -- also Kafka Streams scales per-partition. Thus, if both input topics have 10 partitions, you can run up to 10 threads, and each thread would process 2 partitions. Would this work for you? – Matthias J. Sax Nov 24 '18 at 21:49
  • I am aware of the scale-by-partition concept. It's the records within a single partition for which I need more than 1 CPU to process. What I want to see is that the source stream just keeps on reading records and handing them off to the generic processors (branches). – nads Nov 24 '18 at 23:03
  • For this case, you would need to leave the topology as is, and add `to()` statements to both parallel running (not connected) parts of the topology. It is ok to write into the same topic multiple times -- of course, there won't be any ordering guarantees for this case. Kafka Streams processed partitions of two different topics only with multiple threads, if they are not connected in the topology graph. – Matthias J. Sax Nov 25 '18 at 04:18
  • Got it. Thanks! Could I also use _merge()_ to connect the output of each parallel streams to a _n_ sink streams which write to the same topic? – nads Nov 25 '18 at 04:29
  • You can, but if you use `merge()` you connect both parts into one and thus it will be one task instead of two and executed by only one thread. – Matthias J. Sax Nov 25 '18 at 17:13

1 Answers1

0

The update on 11/27/2018 has answered the question. That being said, the solution does not work for me as I wanted for each fork stream to run as a separate thread. Calling stream.branch() creates multiple sub-streams within the same thread space. Thus all records within a partition are processed in the same thread space.

To achieve sub-partition processing, I ended up using kafka client API in conjunction with java threads and concurrent queues.

nads
  • 378
  • 3
  • 13