This is a generalization of this question.
Suppose I have multiple source streams for which the same set of predicates apply. I would like to set up branch streams such that records which satisfy the predicate, regardless of which source stream, is processed by the same branch stream. As the diagram below shows each branch stream is like a generic processor which transforms incoming records.
The following code block does not work as it should be since it creates a distinct set of branch streams for each source stream.
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source1 = builder.stream("x");
KStream<String, String> source2 = builder.stream("y");
Predicate<String, String>[] branchPredicates = new Predicate[forkCount];
for (int i = 0; i < forkCount; ++i) {
int idx = i;
branchPredicates[i] = ((key, value) ->
key.hashCode() % forkCount == idx);
}
Kstream<String, String>[] forkStreams = Arrays.asList(source1, source2)
.map(srcStream -> srcStream.branch(branchPredicates)
.flatMap(x -> Arrays.stream())
.collect(Collectors.toList());
sorry, I'm mostly a scala developer :)
In the above example, forkStreams.length == branchPredicates.length x 2 and in general, proportional to the number of source streams. Is there a trick in Kafka stream that allows me to keep a one-to-one relationship between the predicates and fork streams?
UPDATE 11/27/2018 There are some progress in that I can:
- Read from all source topics using one source stream
- Connect the source stream to multiple branches
- Distribute messages evenly amount the branches.
However, as the following code block demonstrates ALL fork streams exist in the same thread. What I would like to achieve is to place each fork stream into a different thread to allow better CPU utilization
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream(Arrays.asList("a", "b", "c")
// Create workers
// Need to have predicates for the branches
int totalPerdicates = Integer
.parseInt(props.getProperty(WORKER_PROCESSOR_COUNT));
Predicate<String, String>[] predicates = new Predicate[totalPerdicates];
IntStream
.range(0, totalPerdicates)
.forEach(i -> {
int idx = i;
predicates[i] = (key, value) ->
key.hashCode() % totalPerdicates == idx;
});
forkStreams = Arrays.asList(sourceStreams.branch(predicates));
// Hack- Dump the number of messages processed every 10 seconds
forkStreams
.forEach(fork -> {
KStream<Windowed<String>, Long> tbl =
fork.transformValues(new SourceTopicValueTransformerSupplier())
.selectKey((key, value) -> "foobar")
.groupByKey()
.windowedBy(TimeWindows.of(2000L))
.count()
.toStream();
tbl
.foreach((key, count) -> {
String fromTo = String.format("%d-%d",
key.window().start(),
key.window().end());
System.out.printf("(Thread %d, Index %d) %s - %s: %d\n",
Thread.currentThread().getId(),
forkStreams.indexOf(fork),
fromTo, key.key(), count);
});
Here's a snippet of the output
<snip>
(Thread 13, Index 1) 1542132126000-1542132128000 - foobar: 2870
(Thread 13, Index 1) 1542132024000-1542132026000 - foobar: 2955
(Thread 13, Index 1) 1542132106000-1542132108000 - foobar: 1914
(Thread 13, Index 1) 1542132054000-1542132056000 - foobar: 546
<snip>
(Thread 13, Index 2) 1542132070000-1542132072000 - foobar: 524
(Thread 13, Index 2) 1542132012000-1542132014000 - foobar: 2491
(Thread 13, Index 2) 1542132042000-1542132044000 - foobar: 261
(Thread 13, Index 2) 1542132022000-1542132024000 - foobar: 2823
<snip>
(Thread 13, Index 3) 1542132088000-1542132090000 - foobar: 2170
(Thread 13, Index 3) 1542132010000-1542132012000 - foobar: 2962
(Thread 13, Index 3) 1542132008000-1542132010000 - foobar: 2847
(Thread 13, Index 3) 1542132022000-1542132024000 - foobar: 2797
<snip>
(Thread 13, Index 4) 1542132046000-1542132048000 - foobar: 2846
(Thread 13, Index 4) 1542132096000-1542132098000 - foobar: 3216
(Thread 13, Index 4) 1542132108000-1542132110000 - foobar: 2696
(Thread 13, Index 4) 1542132010000-1542132012000 - foobar: 2881
<snip>
Any suggestions as to how to place each fork stream in a different thread is appreciated.