4

How can I determine when KSQL has fully loaded my data from a Kafka topic into my table?

GOAL: Take 2 Kafka topics, join them and write the results to a new Kafka topic.

EXAMPLE:

I am using Ksql's Rest API to issue the following commands.

CREATE TABLE MyTable (A1 VARCHAR, A2 VARCHAR) WITH (kafka_topic='topicA', key='A1', value_format='json');
CREATE STREAM MyStream (B1 varchar, B2 varchar) WITH (kafka_topic='topicB', value_format='json');
CREATE STREAM MyDestination WITH (Kafka_topic='topicC', PARTITIONS = 1, value_format='json') AS SELECT a.A1 as A1, a.A2 as A2, b.B1 as B1, b.B2 as B2 FROM  MyStream b left join MyTable a on a.A1 = b.B1;

PROBLEM: topicC only has data from topicB, and all joined values are null.

Although I receive back a status of SUCCESS from the create table command, it appears that the data has not fully loaded into the table. Consequently the result of the 3rd command only has data from the stream and does not include data from the table. If I artificially delay before executing the join command, then the resulting topic will correctly have data from both topics. How can I determine when my table is loaded, and it is safe to execute the join command?

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
Hoppy
  • 136
  • 8

2 Answers2

2

This is indeed a great question. At this point KSQL doesn't have a way to automatically execute a stream-table join only once the table is fully loaded. This is indeed a useful feature. A more general and related problem is discussed here: https://github.com/confluentinc/ksql/issues/1751

apurva
  • 171
  • 11
0

Tables in KSQL (and underlying Kafka Streams) have a time dimension, ie, the evolve over time. For a stream-table join, each stream-record is joined with the "correct" table version (ie, tables are versioned by time).

In upcoming CP 5.1 release, you can "pre-load" the table, by ensuring that all record timestamp of the table topic are smaller than the record timestamps of the stream topic. This tells, KSQL, that it needs to process the table topic data first, but advance the table timestamp-version accordingly before it can start joining.

For more details, check out: https://www.confluent.io/resources/streams-tables-two-sides-same-coin

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137