I would like to create Tables in ksqlDB from Debezium source topics, with the ultimate aim of performing a left join on these tables and efficiently outputting materialized views to a downstream database using the JDBC sink connector.
The Debezium source topics have not had any transforms applied (such as ExtractNewRecordState), so contain a 'before' and 'after' property, as described in the Debezium documentation here.
The reason for not applying the ExtractNewRecordState transform (which would presumably simplify matters) is that the source CDC topics may be used for various purposes and it does not appear possible to create multiple topics off the same source database table (since topic names are automatically determined by Debezium and depend on the database server, schema and table name as described here).
The best approach I have found so far is to:
- create a stream in ksqlDB from the raw Debezium input, e.g.:
CREATE STREAM user_stream WITH (KAFKA_TOPIC='mssql.dbo.user', VALUE_FORMAT='AVRO');
- create a second stream selecting the required fields from the 'after' property of the first stream, e.g.:
CREATE STREAM user_stream2 AS SELECT AFTER->user_id, AFTER->username, AFTER->email FROM user_stream EMIT CHANGES;
- finally, convert the second stream to a table as described here, namely:
SELECT user_id,
LATEST_BY_OFFSET(username) AS username,
LATEST_BY_OFFSET(email) AS email
FROM user_stream2
GROUP BY user_id
EMIT CHANGES;
These steps must be repeated to generate each Table, at which point a join can be performed on the Tables to produce an output.
This seems quite long-winded, with a lot of intermediate steps. Performance also seems sluggish. Is there a better and/or more direct way to generated materialized views using ksqlDB and Debezium? Can any of the steps be cut out and/or should I be using a different approach in step 3 (such as a windowing function)?
I'm particularly keen to ensure that the approach taken is the most efficient from a performance and resource usage perspective.