4

I'd like to outer join several (typically 2-10) Kafka topics by key, ideally using the streaming API. All topics will have the same key and partitions. One way to do this join is to create a KStream for each topic and chain calls to KStream.outerJoin:

stream1
    .outerJoin(stream2, ...)
    .outerJoin(stream3, ...)
    .outerJoin(stream4, ...)

However, the documentation of KStream.outerJoin suggests that each call to outerJoin will materialize its two input streams so the above example would materialize not just streams 1 to 4 but also stream1.outerJoin(stream2, ...) and stream1.outerJoin(stream2, ...).outerJoin(stream3, ...). There would be a lot of unnecessary serialization, deserialization, and I/O compared to directly joining the 4 streams.

Another problem with the above approach is that the JoinWindow would not be consistent across all 4 input streams: one JoinWindow would be used to join streams 1 and 2, but then a separate join window would be used to join this stream and stream 3, etc. For example, I specify a join window of 10 seconds for each join and entries with a certain key appear in stream 1 at 0 seconds, stream 2 at 6 seconds, stream 3 at 12 seconds, and stream 4 at 18 seconds, the joined item would get output after 18 seconds, causing an overly high delay. The results depend on the order of the joins, which seems unnatural.

Is there a better approach to multi-way joins using Kafka?

Reinstate Monica
  • 2,420
  • 14
  • 23

3 Answers3

1

I don't know of a better approach in Kafka Stream currently, but it's in the making:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-150+-+Kafka-Streams+Cogroup

Michal Borowiecki
  • 4,244
  • 1
  • 11
  • 18
0

Ultimately I decided to create a custom lightweight joiner that avoids materialization and strictly honors the expiration time. It should be O(1) on average. It fits better with the Consumer API than with the Stream API: for each consumer, repeatedly poll and update the joiner with any received data; if the joiner returns a complete attribute set, forward it on. Here's the code:

import java.util.ArrayList;
import java.util.Collection;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Optional;

/**
 * Inner joins multiple streams of data by key into one stream. It is assumed
 * that a key will appear in a stream exactly once. The values associated with
 * each key are collected and if all values are received within a certain
 * maximum wait time, the joiner returns all values corresponding to that key.
 * If not all values are received in time, the joiner never returns any values
 * corresponding to that key.
 * <p>
 * This class is not thread safe: all calls to
 * {@link #update(Object, Object, long)} must be synchronized.
 * @param <K> The type of key.
 * @param <V> The type of value.
 */
class StreamInnerJoiner<K, V> {

    private final Map<K, Vals<V>> idToVals = new LinkedHashMap<>();
    private final int joinCount;
    private final long maxWait;

    /**
     * Creates a stream inner joiner.
     * @param joinCount The number of streams being joined.
     * @param maxWait The maximum amount of time after an item has been seen in
     * one stream to wait for it to be seen in the remaining streams.
     */
    StreamInnerJoiner(final int joinCount, final long maxWait) {
        this.joinCount = joinCount;
        this.maxWait = maxWait;
    }

    private static class Vals<A> {
        final long firstSeen;
        final Collection<A> vals = new ArrayList<>();
        private Vals(final long firstSeen) {
            this.firstSeen = firstSeen;
        }
    }

    /**
     * Updates this joiner with a value corresponding to a key.
     * @param key The key.
     * @param val The value.
     * @param now The current time.
     * @return If all values for the specified key have been received, the
     * complete collection of values for thaht key; otherwise
     * {@link Optional#empty()}.
     */
    Optional<Collection<V>> update(final K key, final V val, final long now) {
        expireOld(now - maxWait);
        final Vals<V> curVals = getOrCreate(key, now);
        curVals.vals.add(val);
        return expireAndGetIffFull(key, curVals);
    }

    private Vals<V> getOrCreate(final K key, final long now) {
        final Vals<V> existingVals = idToVals.get(key);
        if (existingVals != null)
            return existingVals;
        else {
            /*
            Note: we assume that the item with the specified ID has not already
            been seen and timed out, and therefore that its first seen time is
            now. If the item has in fact already timed out, it is doomed and
            will time out again with no ill effect.
             */
            final Vals<V> curVals = new Vals<>(now);
            idToVals.put(key, curVals);
            return curVals;
        }
    }

    private void expireOld(final long expireBefore) {
        final Iterator<Vals<V>> i = idToVals.values().iterator();
        while (i.hasNext() && i.next().firstSeen < expireBefore)
            i.remove();
    }

    private Optional<Collection<V>> expireAndGetIffFull(final K key, final Vals<V> vals) {
        if (vals.vals.size() == joinCount) {
            // as all expired entries were already removed, this entry is valid
            idToVals.remove(key);
            return Optional.of(vals.vals);
        } else
            return Optional.empty();
    }
}
Reinstate Monica
  • 2,420
  • 14
  • 23
0

If you merge all your streams you will get what you want. Look at this tutorial on how to do it.

Input streams are combined using the merge function, which creates a new stream that represents all of the events of its inputs.

Hrvoje
  • 13,566
  • 7
  • 90
  • 104
  • 1
    This works if all streams have the same value types - but this is a specific case of outer join. In the more general case, outer joins could have different value types. – Kkkev Feb 23 '23 at 18:00