7

With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer connectors over the Spark streaming solution.

Also how tolerant is the Kafka connector solution? We know that with Spark streaming, we can use checkpoints and executors running on multiple nodes for fault tolerant execution, but how is fault tolerance (if possibe) achieved with Kafka connectors? By running the connector on multiple nodes?

Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92
MetallicPriest
  • 29,191
  • 52
  • 200
  • 356

2 Answers2

5

So, generally, there should be no big difference in functionality when it comes to simply reading records from Kafka and sending them into other services.

Kafka Connect is probably easier when it comes to standard tasks since it offers various connectors out-of-the-box, so it will quite probably reduce the need of writing any code. So, if you just want to copy a bunch of records from Kafka to HDFS or Hive then it will probably be easier and faster to do with Kafka connect.

Having this in mind, Spark Streaming drastically takes over when You need to do things that are not standard i.e. if You want to perform some aggregations or calculations over records and write them to Hive, then You probably should go for Spark Streaming from the beginning.

Genrally, I found doing some substandard things with Kafka connect, like for example splitting one message to multiple ones(assuming it was for example JSON array) to be quite troublesome and often require much more work than it would be in Spark.

As for the Kafka Connect fault tolerance, as it's described in the docs this is achieved by running multiple distributed workers with same group.id, the workers redistribute tasks and connectors if one of them fails.

Dominik Wosiński
  • 3,769
  • 1
  • 8
  • 22
  • Do these connectors (at least Kafka to HDFS) come with the standard Kafka installation, or do you need to separately install them. – MetallicPriest Jun 04 '20 at 12:56
  • 2
    I don't think they are part of vanilla Kafka binary :) – Dominik Wosiński Jun 04 '20 at 12:57
  • AFAIK kafka connect is just library to connect and fetch/push installations like topics and all those you have to do it seperate not part of the library. I'd suggest you to go through respective documentations. – Ram Ghadiyaram Jun 04 '20 at 16:42
  • Kafka connectors are jar file that pull data from and push to kafka. This has been the most important aspect when you have to pull huge amount of data from a topic. Assume a scenario where you perform some sort of aggregation on live data that would be pushed to a topic and if you would like that to be saved and reduce the code for the process then kafka connectors are the most useful!. Any time ksqldb makes a new stream aggregate or table, it's data is a topic. To push that important aggregate well......Connectors again boom – Harshith Yadav Dec 06 '20 at 05:50
5

in which situations I should prefer connectors over the Spark streaming solution.

"It Depends" :-)

  1. Kafka Connect is part of Apache Kafka, and so has tighter integration with Apache Kafka in terms of security, delivery semantics, etc.
  2. If you don't want to write any code, Kafka Connect is easier because it's just JSON to configure and run
  3. If you're not using Spark already, Kafka Connect is arguably more straightforward to deploy (run the JVM, pass in the configuration)
  4. As a framework, Kafka Connect is more transferable since the concepts are the same, you just plugin the appropriate connector for the technology that you want to integrate with each time
  5. Kafka Connect handles all the tricky stuff for you like schemas, offsets, restarts, scaleout, etc etc etc
  6. Kafka Connect supports Single Message Transform for making changes to data as it passes through the pipeline (masking fields, dropping fields, changing data types, etc etc). For more advanced processing you would use something like Kafka Streams or ksqlDB.
  7. If you are using Spark, and it's working just fine, then it's not necessarily prudent to rip it up to use Kafka Connect instead :)

Also how tolerant is the Kafka connector solution? … how is fault tolerance (if possibe) achieved with Kafka connectors?

  1. Kafka Connect can be run in distributed mode, in which you have one or more worker processes across nodes. If a worker fails, Kafka Connect rebalances the tasks across the remaining ones. If you add a worker in, Kafka Connect will rebalance to ensure workload distribution. This was drastically improved in Apache Kafka 2.3 (KIP-415)
  2. Kafka Connect uses the Kafka consumer API and tracks offsets of records delivered to a target system in Kafka itself. If the task or worker fails you can be sure that it will restart from the correct point. Many connectors support exactly-once delivery too (e.g. HDFS, Elasticsearch, etc)

If you want to learn more about Kafka Connect see the docs here and my talk here. See a list of connectors here, and tutorial videos here.


Disclaimer: I work for Confluent and a big fan of Kafka Connect :-)

Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92