1

I'm currently looking into Kafka Connect to stream some of our databases to a data lake. To test out Kafka Connect I've setup a database with one of our project databases in. So far so good.

Next step I configured Kafka Connect with mode following properties:

{
  "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
  "timestamp.column.name": "updated_at,created_at",
  "incrementing.column.name": "id",
  "dialect.name": "SqlServerDatabaseDialect",
  "validate.non.null": "false",
  "tasks.max": "1",
  "mode": "timestamp+incrementing",
  "topic.prefix": "mssql-jdbc-",
  "poll.interval.ms": "10000",
}

While this works for the majority of my tables where I got an ID and a created_at / updated_at field, it won't work for my tables where I solved my many-to-many relationships with a table in between and a composite key. Note that I'm using the generic JDBC configuration with a JDBC driver from Microsoft.

Is there a way to configure Kafka Connect for these special cases?

Vince V.
  • 3,115
  • 3
  • 30
  • 45
  • Have you tried using the `query` configuration options to write a join, for example? Otherwise, the documented Kafka pattern is using Kafka Streams to join your "lookup table" topics with the "data table" topics – OneCricketeer Sep 09 '18 at 15:23
  • @cricket_007 I don't quite understand. I'm under the impression that the query parameter can be used when you want to track specific queries. In my case I want to track the complete database. It could be that I'm wrong though. :-) – Vince V. Sep 10 '18 at 06:51
  • The incrementing timestamp options are also "specific queries", no I'm not sure I understand your concern – OneCricketeer Sep 10 '18 at 13:04

1 Answers1

2

Instead of one connector to pull all of your tables, you may need to create multiple ones. This would be the case if you want to use different methods for fetching the data, or different ID/timestamp columns. As @cricket_007 says, you can use the query option to pull back the results of a query—which could be a SELECT expressing your multi-table join. Even when pulling data from a single table object, the JDBC connector itself is just issuing a SELECT * from the given table, with a WHERE predicate to restrict the rows selected based on the incrementing ID/timestamp.

The alternative is to use log-based change data capture (CDC), and stream all changes directly from the database into Kafka.

Whether you use JDBC or log-based CDC, you can use stream processing to resolve joins in Kafka itself. An example of this is Kafka Streams or KSQL. I've written about the latter a lot here.

You might also find this article useful describing in detail your options for integrating databases with Kafka.

Disclaimer: I work for Confluent, the company behind the open-source KSQL project.

Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92
  • In our case, the auto-increment identity column is named as _id. If I am reading your first statement correctly, does that mean, we need a different connector for each table as the incrementing column name is different for each table?
    – Suhas Jan 06 '21 at 17:58
  • @suhas Yes. Or use log-based CDC e.g. https://debezium.io/documentation/reference/connectors/sqlserver.html – Robin Moffatt Jan 07 '21 at 08:35
  • One of the databases is MariaDB in a Galera cluster which uses InnoDB engine instead of MySQL engine. In any case, Debezium does not support MariaDB so I am exploring JDBC options – Suhas Jan 07 '21 at 13:36