Kafka Connect vs Streams for Sinks

Question

I am trying to understand what Connect buys you that Streams does not. We have a part of our application where we want to consume a topic and write to mariadb.

I could accomplish this with a simple processor. Read the record, store in state store and then bulk insert into mariadb.

Why is this a bad idea? What does JDBC Sink Connector buy you?

score 25 · Accepted Answer · answered Jan 18 '19 at 17:22

Great question! It's all about using the right tool for the job. Kafka Connect's specific purpose is streaming integration between source systems and Kafka, or from Kafka down to other systems (including RDBMS).

What does Kafka Connect give you?

Scalability; you can deploy multiple workers and Kafka Connect will distribute tasks across them
Resilience; if a node fails, Kafka Connect will restart the work on another worker
Ease of use; connectors exist for numerous technologies, so to implement a connector usually means just a few lines of JSON
Schema management; support for schemas in JSON, full integration with the Schema Registry for Avro, pluggable converters from the community for Protobuf
Inline transformations with Single Message Transform
Unified and centralised management and configuration for all your integration tasks

That's not to say that you can't do this in Kafka Streams, but you would end up having to code a lot of this yourself, when it's provided out of the box for you by Kafka Connect. In the same way you could use the Consumer API and a bunch of bespoke code to do the stream processing that Kafka Streams API gives you, similarly you could use Kafka Streams to get data from a Kafka topic into a database—but why would you?

If you need to transform data before it's sent to a sink then a recommended pattern is to decouple the transformation from the sending. Transform the data in Kafka Streams (or KSQL) and write it back to another Kafka topic. Use Kafka Connect to listen to that new topic and write the transformed messages to the target sink.

Just want to add the the excellent answer: Kafka Streams is not designed to talk to external systems. This can have multiple implications on processing guarantees etc. In particular, exactly-once processing breaks if you connect to external systems. -- Also it's a question about decoupling: if your external system goes down, Kafka Streams would most likely crash. Kafka Connect however can handle this case for you seamlessly. — Matthias J. Sax, Jan 18 '19 at 17:49
@Robin Moffatt, on your comment on transforming before sending to the sink. Is connect expecting the data is a particular format? For example if the records are json, the fields match the column names of the table? — Chris, Jan 18 '19 at 21:21
@Chris Connect uses an internal `Struct` class. If you have plain JSON without a `schema` and `payload` fields, then it is considered "schemaless", and will have limited operations that you can apply to those records. This might work well if storing records in Mongo or Elasticsearch, but not so well when into a RDBMS — OneCricketeer, Jan 20 '19 at 00:00

Kafka Connect vs Streams for Sinks

1 Answers1