This question is a follow up to Kafka Streams with lookup data on HDFS . I need to join (like a "map-side" join) small dictionary data to the main Kafka stream
AFAIK, a Kafka Stream instance always works on a given partition of a topic. If I wanted to do lookups, I needed to repartition both streams for the join key, to bring the related records together.
What is the cost of repartitioning back and forth several times if multiple lookup data need to be checked?
Wouldn't be possible to send the whole lookup dataset to each partition, so when I build a KTable
from the lookup
topic, I'll see the whole data set in all the Kafka Stream application instances.
Thus I could do the lookup in the KStream#transform()
method which would take the local RocksDB store with all the
lookup data I have.
I'm wondering which option would be more appropriate:
insert the same data (the whole data set) to each partition of a topic and do the lookups in
KStream#transform
. When the topic is overpartitioned, we'll have lot of duplicate data, but for a small dataset this shouldn't be a problem.do repartitioning of both streams using the DSL API to be able to perform the lookups (joins). What are the implications here in terms of performance?