11

I'm using materialized KTable to use for left join with my KStream(while the stream is the left side).

However, it seem to process immediately, without waiting for the current version of the KTable to load..

I have a lot of values in my source topic for the KTable and when I start the application, a lot of joins fail(well, not really since it is a left join).

Can I make it start in delay so it would wait for the initial topic load?

Ben Yaakobi
  • 1,620
  • 8
  • 22

2 Answers2

12

Processing is time synchronized in Kafka Streams. Hence, the table input topic and stream input topic are processed based on record timestamp order. This is semantically sound, because on a stream-table join, you don't want to join a stream record with an older version nor with a newer version of the KTable, but with the right version based on the stream record timestamp.

If your data is not properly timestamped, you can try to specify a custom timestamp extractor for via builder.table(..., Consumed.with(...)) to return timestamps that ensure proper behavior (ie, maybe smaller than timestamp of the first stream record?)

Note, that a proper timestamp synchronization requires Kafka Streams 2.1. Older version synchronize time in best effort manner only and may not provide the behavior you want. For more details, see KIP-353.

Kafka 3.0 ships with more timestamp synchronization improvements: https://cwiki.apache.org/confluence/display/KAFKA/KIP-695%3A+Further+Improve+Kafka+Streams+Timestamp+Synchronization

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • Actually, I do want to join with the newest version of the KTable.. Is there a way to do this? – Ben Yaakobi Jun 12 '19 at 07:51
  • I've upvoted your answer since it does explain why the KTable doesn't work but accepted the other answer because it is the solution to my requirement.. Thanks a lot! – Ben Yaakobi Jun 12 '19 at 09:59
  • Well. A `GlobalKTable` does behave differently, but it also provides different semantics and different disk requirement: it's not a sharded but a broadcasted/replicated table increasing the storage requirements client side. Thus, you should only use if for small data sets -- it is also not time synchronized to a KStream and thus a stream-table join has different semantics than a stream-globalTable join.---Just want to make sure that you are aware what using a GlobalKTable implies. It not a "drop in" replacement for a KTable but you change the semantics of your program. – Matthias J. Sax Jun 12 '19 at 17:25
  • 1
    `Actually, I do want to join with the newest version of the KTable.. Is there a way to do this?` -- if you are using Kafka Streams 2.1 or newer, you can use a custom timestamp extractor for the KTable that always returns `0` as timestamp. This way, you get unsynchronized behavior and the KTable updates are applied immediately. --- Note, that unsynchronized processing makes your application inherently non-deterministic though and you cannot apply time-traveling to reproduce a previous result. – Matthias J. Sax Jun 12 '19 at 17:29
  • I tried what you said about returning 0 as the timestamp for the `KTable`.. However it still happens.. The topology starts before the KTable has been fully loaded.. I don't need time synchronized KTable.. I need a compact cache that I can load values from. Currently the KTable seem to be the only one giving me that solution as much as not ideal as it is. – Ben Yaakobi Jun 13 '19 at 08:10
  • Interesting. We are aware of some bugs with regard to timestamp synchronization. Maybe you hit one of those issues. If your data set is small, using `GlobalKTable` seems to be a viable solution. – Matthias J. Sax Jun 13 '19 at 16:28
  • Tried your suggestion to use a zero timestamp and it works in the new version of kafka! Please add it to your answer as this was the solution – Ben Yaakobi Oct 31 '19 at 11:49
  • @MatthiasJ.Sax Thanks a lot for the valuable insights. I am facing the same case currently (Kafka 2.3.0). I am trying to read from topics A and B, and create a table from topic B (after some key transformation on the stream from B). I would like to make sure that B is read first before A, so that I don't "miss" joins between... So setting the timestamp to 0 (using a timestamp extractor) in B messages would work in theory? Or am I missing something there? – nsanglar Apr 20 '20 at 08:37
  • 1
    Using a custom `TimestampExtractor` and let it return `0` for all records would basically mimic the behavior of a `GlobalKTable` what is bootstrapped at startup. This should work. – Matthias J. Sax Apr 20 '20 at 17:24
  • @Matthias J. Sax I have watched your talk about time, red i think enough material about Event Time vs Processing time, and making a load deterministic and what not. In this case, i would like to understand what you see as a problem, if someone put zero using a timestampExtractor. I don't understand the unsynchronized comment. In my mind here we're talking of Ktable-Ktable Join, and caching one of the table before emitting, what exactly becomes undeterministic ? I don't see how the result can change from one run to the other ? The result of the join is always going to be the same isn't it ? – MaatDeamon Jul 01 '22 at 20:46
  • I mean the table on the left of the join, always keeps its timestamp, it is the one on the right, that is put to zero so it can be consumed entirely before the left. I can't possibly see how from a one run to the other, the output of that operation can be different. Only thing i can see are late arrival, but that's a different issue. Hence, if you can expand on what you mean by unsynchronized behavior that would be helpful – MaatDeamon Jul 01 '22 at 20:51
  • 1
    It's depends if you are only interested in the latest result or also want to replay older ones -- if you want to replay older ones, you need properly timestamped data. For the latest result, if you use timestamp zero for all record, it would be correct (as the join is eventually consistent). But the notion of "latest" result is a little fuzzy anyway in a streaming scenario, as it potentially always changes... -- The "issue" is more important for stream-table joins thought; not for table-table. – Matthias J. Sax Jul 05 '22 at 16:21
2

You could use the GlobalKTable. It waits until all values synchronized.

todaynowork
  • 976
  • 8
  • 12