4

This question is a follow up to Kafka Streams with lookup data on HDFS . I need to join (like a "map-side" join) small dictionary data to the main Kafka stream

AFAIK, a Kafka Stream instance always works on a given partition of a topic. If I wanted to do lookups, I needed to repartition both streams for the join key, to bring the related records together.

What is the cost of repartitioning back and forth several times if multiple lookup data need to be checked? Wouldn't be possible to send the whole lookup dataset to each partition, so when I build a KTable from the lookup topic, I'll see the whole data set in all the Kafka Stream application instances. Thus I could do the lookup in the KStream#transform() method which would take the local RocksDB store with all the lookup data I have.

I'm wondering which option would be more appropriate:

  • insert the same data (the whole data set) to each partition of a topic and do the lookups in KStream#transform. When the topic is overpartitioned, we'll have lot of duplicate data, but for a small dataset this shouldn't be a problem.

  • do repartitioning of both streams using the DSL API to be able to perform the lookups (joins). What are the implications here in terms of performance?

Community
  • 1
  • 1
Bruckwald
  • 797
  • 8
  • 23

1 Answers1

5

AFAIK, a Kafka Stream instance always works on a given partition of a topic. If I wanted to do lookups, I needed to repartition both streams for the join key, to bring the related records together.

Yes, as of Apache Kafka 0.10.0 and 0.10.1, this is what you need to do.

What is the cost of repartitioning back and forth several times if multiple lookup data need to be checked? Wouldn't be possible to send the whole lookup dataset to each partition, so when I build a KTable from the lookup topic, I'll see the whole data set in all the Kafka Stream application instances.

Such functionality -- we often describe it as "global KTable" or "global state" -- would be useful indeed, and we're already discussing when/how we could add it.

Update Feb 28, 2017: The first round of functionality around global tables was released with Kafka 0.10.2, where you'll have the ability to perform a KStream-to-GlobalKTable join.

do repartitioning of both streams using the DSL API to be able to perform the lookups (joins). What are the implications here in terms of performance?

The implications depend primarily on the characteristics of the input data (data volume, uniform vs. skewed data distribution, etc).

miguno
  • 14,498
  • 3
  • 47
  • 63
  • Thanks Michael! I've already started to check the Kafka Streams internals how I could have such a global state but any suggestions would be great. Or probably creating an own processor which populates the rocksdb accoding to my needs? – Bruckwald Sep 23 '16 at 09:02
  • 1
    At the moment, yes, I'd suggest to use a custom processor for that. Within it, you can use a "normal" Kafka consumer client that fully reads all the partitions of the respective Kafka topic you are interested in (for the "global state/table" idea above), and then proceed as needed for your use case. – miguno Sep 26 '16 at 08:48