0

I have a use-case in which I am receiving the tweets on a topic, and user-details on other topic. I need to find username from the user-details and set it to tweets. Using following code I am able to get the expected outcome.

KStream<String, Tweet> tweetStream = builder
                .stream("tweet-topic",
                        Consumed.with(Serdes.String(),
                                serdeProvider.getTweetSerde()));

        KTable<String, User> userTable = builder.table("user-topic",
                Consumed.with(Serdes.String(),
                        serdeProvider.getUserSerde()));

        KStream<String, Tweet> finalStream = tweetStream.leftJoin(userTable, (tweetDetail, userDetail) -> {
            if (userDetail != null) {
                return tweetDetail.setUserName(userDetail.getName());
            }
            return tweetDetail;
        }, Joined.with(Serdes.String(), serdeProvider.getTweetSerde(),
                serdeProvider.getUserSerde()));

However, if there are 1000 records in kTable topic, to process 1Million this logic is taking more than 2Hours.Earlier it was taking 2 to 3mins.

Earlier, when user-details were in local hash map, it used to approx 10mins to process all the data. Is there any otherway to avoid LeftJoin or improve its performance?

Suchita
  • 55
  • 1
  • 9
  • 1
    Left-join basically just a key lookup into the RocksDB store, and thus I am wondering your processing rate is quite low. Kafka Streams can easily process in the range of 10.000 records per second per thread. How long does it take to just read 1M records without doing the join? Maybe your overall configuration has issues? How many input topic partitions do you have and how many thread are you using? – Matthias J. Sax Dec 01 '19 at 03:20
  • @MatthiasJ.Sax : Left join is taking approx 10000154Nano sec. per msg, i.e. 10Millisecond per msg. Before left join I was able to get 80Msg in 10 millisecond (8k records per seconds). I am using single thread and there are 12partitions of the topic. – Suchita Dec 01 '19 at 17:07
  • Not sure how much overhead (de)serialization is... -- each read-access to the store need to deserialize the Tweet. A simple workaround as you have 12 partitions is, to run up to 12 threads -- if a single machine becomes CPU bound, you can also run on 2 machines with 6 threads each. – Matthias J. Sax Dec 01 '19 at 18:08
  • Another workaround might be, to implement the join "manually" using a custom `#transformer()` step -- this way you can have a backing store but keep an in-memory copy of you Tweets in a HashMap in parallel (in case that the deserialization overhead is the root cause of the low throughput). – Matthias J. Sax Dec 01 '19 at 18:10
  • I also tried to add custom transformer, but was not able to share the stateStore between kTable and Kstream. – Suchita Dec 01 '19 at 18:29
  • Also I cannot maintain the local hashmap, it may lead to OOM. – Suchita Dec 01 '19 at 20:48
  • `Also I cannot maintain the local hashmap, it may lead to OOM` -- for this case, the approach with a custom `transform()` does not make sense anyway. I would recommend to scale out using more threads. – Matthias J. Sax Dec 01 '19 at 22:02
  • @MatthiasJ.Sax : the leftJoin operation is failing in auto-scaling mode of application. As with this approach we have to select the key, due to which all the data goes to single pod. and other pod remains in idle state. We tried to replace Ktable with GlobalKTable, but its causing latency due to stacktrace=org.apache.kafka.streams.errors.InvalidStateStoreException: the state store, StoreName, is not open exception. U have any better solution to handle such cases. – Suchita May 29 '20 at 15:16
  • `all the data goes to single pod. and other pod remains in idle state` -- that is rather weird. Or does your repartition topic have only one partition? If yes, it explains it and you need to create topic with as many partitions as you want to scale out. – Matthias J. Sax May 29 '20 at 19:04
  • the topic has 12 partitions, however the key is same. Hence, data is read by same pod. – Suchita May 31 '20 at 13:25
  • Well, for this case you got a "data problem" is there is not much you can do about it... -- Btw: did you try out Kafka Streams in-memory store instead of RocksDB? – Matthias J. Sax Jun 01 '20 at 05:35
  • Nope. We require data to be present even after pod failure/ restart . (Persistent Storage) – Suchita Jun 02 '20 at 09:51
  • A persistent store, is still consider "ephemeral" -- the backing changelog is used for fault-tolerance. Thus, a persistent store "only" speeds up recovery, but even if you use an in-memory store, it would be recovered from the changelog if a pod goes down. To reduce recovery time for in-memory stores, you can use StandbyTasks (note, that is a bug for in-memory StandbyTask -- thus, you would need to wait for upcoming 2.6 release; otherwise, it won't work as expected). But maybe regular changelog recovery is fast enough anyway for in-memory stores. – Matthias J. Sax Jun 02 '20 at 15:36

0 Answers0