9

I need to understand when to use Kafka connect vs. own consumer/producer written by developer. We are getting Confluent Platform. Also to achieve fault tolerant design do we have to run the consumer/producer code ( jar file) from all the brokers ?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Anirban
  • 257
  • 4
  • 12

3 Answers3

7

Kafka connect is typically used to connect external sources to Kafka i.e. to produce/consume to/from external sources from/to Kafka.

Anything that you can do with connector can be done through Producer+Consumer

Readily available Connectors only ease connecting external sources to Kafka without requiring the developer to write the low-level code.

Some points to remember..

  1. If the source and sink are both the same Kafka cluster, Connector doesn't make sense
  2. If you are doing changed-data-capture (CDC) from a database and push them to Kafka, you can use a Database source connector.
  3. Resource constraints: Kafka connect is a separate process. So double check what you can trade-off between resources and ease of development.
  4. If you are writing your own connector, it is well and good, unless someone has not already written it. If you are using third-party connectors, you need to check how well they are maintained and/or if support is available.
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
JavaTechnical
  • 8,846
  • 8
  • 61
  • 97
  • @cricket_007 How about adding *typically* to the statement? – JavaTechnical Dec 27 '19 at 05:48
  • There's valid uses for it. But not within the same cluster – OneCricketeer Dec 27 '19 at 05:54
  • 2
    For my use case, I want to take data from a topic, complete extensive transformation on the data (not basic) and save to a MongoDB. I know there is a connector for MongoDB but when to use this over a Consumer API is still unclear for me. Could more guidance be provided? I assume an attributing factor is what transformations on the data are required. Is there anything else to consider overall? – bmd Nov 20 '22 at 20:49
1

do we have to run the consumer/producer code ( jar file) from all the brokers ?

Don't run client code on the brokers. Let all memory and disk access be reserved for the broker process.

when to use Kafka connect vs. own consumer/produce

In my experience, these factors should be taken into consideration

  1. You're planning on deploying and monitoring Kafka Connect anyway, and have the available resources to do so. Again, these don't run on the broker machines
  2. You don't plan on changing the Connector code very often, because you must restart the whole connector JVM, which would be running other connectors that don't need restarted
  3. You aren't able to integrate your own producer/consumer code into your existing applications or simply would rather have a simpler produce/consume loop
  4. Having structured data not tied to the a particular binary format is preferred
  5. Writing your own or using a community connector is well tested and configurable for your use cases

Connect has limited options for fault tolerance compared to the raw producer/consumer APIs, with the drawbacks of more code, depending on other libraries being used

Note: Confluent Platform is still the same Apache Kafka

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • thanks @cricket_007 for the detailed reply. Will distributed mode for Kafka connect not ensure fault tolerant design ? Also if we don't go with Kafka connect, how the consumer/producer should be run to achieve fault tolerance ? – Anirban Dec 27 '19 at 23:16
  • What kind of fault tolerance are you expecting? Running multiple consumer applications acts the same as Connect Distributed, but you can only run as many instances as topic partitions – OneCricketeer Dec 27 '19 at 23:27
  • fault tolerant meaning not running a single instance but running multiple instances from different servers. What options we have for producers ? – Anirban Dec 28 '19 at 20:11
  • You can do the same thing using multiple servers running regular consumer applications, as mentioned. Producer scalability depends on your input source, but fault tolerance is limited to a single instance – OneCricketeer Dec 28 '19 at 21:20
  • Hi @cricket_007 , my source is a log file. What are the the options to speed up the process of pushing data to Kafka Broker ? – Anirban Dec 31 '19 at 19:01
  • Fluentbit or Filebeat are popular options for that. Flume is outdated, but works too. Just don't use Kafka Connect file source (it's not meant for production use) – OneCricketeer Dec 31 '19 at 19:50
  • Hi @cricket_007 I am thinking of using file stream source connector that comes with Confluent to ingest data from log files. Do you see any issue with that. We will get 2000 lines of 1K size every second. Also , could you explain your point #2 (You don't plan on changing the Connector code very often, because you must restart the whole connector JVM, which would be running other connectors that don't need restarted) ? Typical command I will be using to run the connector in distributed mode is connect-distributed myworker.properties . Are you talking about changing "connect-distributed" ? – Anirban Jan 02 '20 at 21:10
  • @Anirban That's provided by Apache Kafka, not Confluent. And Confluent explicitly says not to use it for the purposes you're looking for.... Filebeat is used mainly with Elasticsearch, and if you are not using that, there's likely an equivalent solution for your log collection provider. On point 2, if you write a custom connector, you package it in a JAR. To upgrade the connector, you must copy that JAR into your Connect cluster via SSH for example. Then you must restart the connector process. Same process as any other Java application (update, stop old instances, and start new) – OneCricketeer Jan 03 '20 at 00:42
  • Thanks @cricket_007. I saw the documentation where it says that file stream source connector is for test/dev only. But I wanted to test it out. I am running it in distributed mode from the 2 servers on connect cluster. The jobs are running but not seeing the data I am appending to the file. What could be wrong ? Here are some of the properties from connect-distributed.properties file - – Anirban Jan 08 '20 at 19:40
  • I suggest you use the spool dir connector, which is more suited for that use case – OneCricketeer Jan 08 '20 at 19:44
  • name=filestream connector.class=FileStreamSourceConnector file=/home/ec2-user/test.txt tasks.max=1 topic=security_log listeners=http://0.0.0.0:8084 – Anirban Jan 08 '20 at 19:49
  • What about it? That would be standalone properties, not distributed... The connector doesn't accept appends, only existing file data is read into Kafka – OneCricketeer Jan 09 '20 at 00:16
  • Hi @Cricket_007, it will be really helpful if you could list the values to be set for distributed mode for file stream source. – Anirban Jan 09 '20 at 14:37
  • Sure...I will create a new post. I already spent a few days on this. I just want to see an end result before giving up on this (; – Anirban Jan 09 '20 at 15:53
1

Kafka Connect: Kafka Connect is an open-source platform which basically contains two types: Sink and Source. The Kafka Connect is used to fetch/put data from/to a database to/from Kafka. The Kafka connect helps to use various other systems with Kafka. It also helps in tracking the changes (as mentioned in one of the answers Changed Data Capture (CDC) ) from DB's to Kafka. The system maintains the offset, in order to read/write data from that particular offset to Kafka or any other database.

For more details, you can refer to https://docs.confluent.io/current/connect/index.html

The Producer/Consumer:
The Producer and Consumer are just an end system, which use the Kafka to produce and consume topics to/from Kafka. They are used where we want to broadcast the data to various consumers in a consumer group. This kind of system also maintains the lag and offsets of data for the consumer groups.

No, you don't need to run any producer/consumer while running Kafka connect. In case you want to check there is no data loss you can run the consumer while running Source Connectors. In case, of Sink Connectors, the already produced data can be verified in your database, by running their particular select queries.