Relative amount of events in kafka streams using join

Question

I'm writing a kafka streams app in which I'm producing statistics for web pages. I have a stream of information about web pages which includes the page type (news, gaming, blog, etc.) and the page language (en, fr, ru, etc.) in a struct.

I've filtered this stream to a 2nd stream which includes all languages for a specific page type. For this example, we can assume that the filtered stream includes all events of the "news" pages.

I would now like to output to a topic the value a of the amount of pages per language divided by the total amount of pages of the same type.

I used .count() to create a KTable which counts the events per language . I also used the .count() to create a KTable which includes all events of the same type.

In order to produce the division, I was planning to use a join between the stream which will take the left value and divide it by the right value. Unforauntely, this doesn't seem to work as the left value's keys are the language and the right value's key is the page type.

My code is as following:

ValueJoiner<Long, Long, Float> valueJoiner = (leftVal, rightVal) -> {
            if ((rightVal != null) && (leftVal != null))
            {        
                return leftVal.floatValue()/rightVal;
            }
            return 0f;
        };

// the per language table for news pages
KTable<String, Long> langTable = newsStream.selectKey((ignored, value) -> value.getLang()).groupByKey().count();
// the table which counts all events of news pages
KTable<String, Long> allTable = newsStream.groupBy((ignored, value) -> value.getType()).count();

// this is the join that doesn't produce values (as there are no common keys?)
KTable<String, Float> joinedLangs = langTable.join(allTable, valueJoiner);

What would be the best way to make this code work and produce the relative amount values?

score 0 · Answer 1 · answered Jul 04 '22 at 09:47

If we are talking about Join, then input data on both sides (left and right) must be co-partitioned. Refer https://developer.confluent.io/tutorials/foreign-key-joins/kstreams.html

Join output records are effectively created as follows, leveraging the user-supplied ValueJoiner:

KeyValue<K, LV> leftRecord = ...;
KeyValue<K, RV> rightRecord = ...;
ValueJoiner<LV, RV, JV> joiner = ...;

KeyValue<K, JV> joinOutputRecord = KeyValue.pair(
    leftRecord.key, /* by definition, leftRecord.key == rightRecord.key */
    joiner.apply(leftRecord.value, rightRecord.value)
  );

Join co-partitioning requirements.

Input data must be co-partitioned when joining.
This ensures that input records with the same key, from both sides of the join, are delivered to the same stream task during processing.
It is the responsibility of the user to ensure data co-partitioning when joining. Consider using global tables (GlobalKTable) for joining because they do not require data co-partitioning.

The requirements for data co-partitioning are:

The input topics of the join (left side and right side) must have the same number of partitions.
All applications that write to the input topics must have the same partitioning strategy so that records with the same key are delivered to same partition number.
In other words, the keyspace of the input data must be distributed across partitions in the same manner. Applications that use Kafka’s Java Producer API must use the same partitioner (cf. the producer setting "partitioner.class" aka ProducerConfig.PARTITIONER_CLASS_CONFIG), and applications that use the - Kafka’s Streams API must use the same StreamPartitioner for operations such as KStream#to().
If you happen to use the default partitioner-related settings across all applications, you do not need to worry about the partitioning strategy.

Relative amount of events in kafka streams using join

1 Answers1

Join co-partitioning requirements.

The requirements for data co-partitioning are: