2

I am new to Kafa and data ingestion. I know Kafka is fault tolerant, as it keeps the data redundantly on multiple nodes. However, what I don't understand is how can we achieve fault tolerance on the source/producer end. For example, If I have netcat as souce, as given in the example below.

nc -l [some_port] | ./bin/kafka-console-producer --broker-list [kafka_server]:9092 --topic [my_topic]

The producer would fail pushing messages if the node doing netcat goes down. I was thinking if there was a mechanism where Kafka can pull the input itself, so for example, if on one node the netcat fails, another node can take over and start pushing messages using netcat.

My second question is how this is achieved in Flume, as it is a pull based architecture. Would Flume work in this case, that is, if one node doing netcat fails?

MetallicPriest
  • 29,191
  • 52
  • 200
  • 356

1 Answers1

1

Every topic, is a particular stream of data (similar to a table in a database). Topics, are split into partitions (as many as you like) where each message within a partition gets an incremental id, known as offset as shown below.

Partition 0:

+---+---+---+-----+
| 0 | 1 | 2 | ... |
+---+---+---+-----+

Partition 1:

+---+---+---+---+----+
| 0 | 1 | 2 | 3 | .. |
+---+---+---+---+----+

Now a Kafka cluster is composed of multiple brokers. Each broker is identified with an ID and can contain certain topic partitions.

Example of 2 topics (each having 3 and 2 partitions respectively):

Broker 1:

+-------------------+
|      Topic 1      |
|    Partition 0    |
|                   |
|                   |
|     Topic 2       |
|   Partition 1     |
+-------------------+

Broker 2:

+-------------------+
|      Topic 1      |
|    Partition 2    |
|                   |
|                   |
|     Topic 2       |
|   Partition 0     |
+-------------------+

Broker 3:

+-------------------+
|      Topic 1      |
|    Partition 1    |
|                   |
|                   |
|                   |
|                   |
+-------------------+

Note that data is distributed (and Broker 3 doesn't hold any data of topic 2).

Topics, should have a replication-factor > 1 (usually 2 or 3) so that when a broker is down, another one can serve the data of a topic. For instance, assume that we have a topic with 2 partitions with a replication-factor set to 2 as shown below:

Broker 1:

+-------------------+
|      Topic 1      |
|    Partition 0    |
|                   |
|                   |
|                   |
|                   |
+-------------------+

Broker 2:

+-------------------+
|      Topic 1      |
|    Partition 0    |
|                   |
|                   |
|     Topic 1       |
|   Partition 0     |
+-------------------+

Broker 3:

+-------------------+
|      Topic 1      |
|    Partition 1    |
|                   |
|                   |
|                   |
|                   |
+-------------------+

Now assume that Broker 2 has failed. Broker 1 and 3 can still serve the data for topic 1. So a replication-factor of 3 is always a good idea since it allows for one broker to be taken down for maintenance purposes and also for another one to be taken down unexpectedly. Therefore, Apache-Kafka offers strong durability and fault tolerance guarantees.

Note about Leaders: At any time, only one broker can be a leader of a partition and only that leader can receive and serve data for that partition. The remaining brokers will just synchronize the data (in-sync replicas). Also note that when the replication-factor is set to 1, the leader cannot be moved elsewhere when a broker fails. In general, when all replicas of a partition fail or go offline, the leader will automatically be set to -1.


Having said that, as far as your producer lists all the addresses of the Kafka brokers that are in the cluster (bootstrap_servers), you should be fine. Even when a broker is down, your producer will attempt to write the record to another broker.

Finally, make sure to set acks=all (might have an impact to throughput though) so that the all in-sync replicas acknowledge that they received the message.

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
  • Well I understand what you described here, but my question was more about the source. Here we are running netcat on one node. What happens if that node fails, can another node start doing netcat to give messages to Kafka. Does Kafka provide anything for such kind of fault tolerance? Or do we need to implement this ourselves for source. If so, how? Is Flume helpful there, as it is has a pull based architecture. – MetallicPriest May 21 '20 at 11:22
  • 1
    @MetallicPriest That's exactly my point here. As far as you have at least 3 brokers, and replication factor=3, you shouldn't worry if you lose a broker for some time. – Giorgos Myrianthous May 21 '20 at 11:23
  • Well, your point is about pushing the output of netcat to the Kakfa producer. That is clear. But my question was more about the source itself, the one doint netcat. If the node doing netcat fails, we won't have any messages being given to the Kafka producer. So, basically, how can we achieve fault tolerance at the source? Can we for example have the Kafka producers running the netcat command themselves and pushing that output in their topics/message queues? – MetallicPriest May 21 '20 at 11:32
  • First of all, can you clarify what do you mean by netcat? You mentioned it many times and I am not sure I understand the context. – Giorgos Myrianthous May 21 '20 at 11:34
  • Well netcat is a tool, which in this example is used to read the streaming data from a web server. So, basically the Kafka producers are being fed lines using netcat. But you can replace netcat with any source, and the question remains the same. For the netcat example, for example, can Kafka run that command itself. – MetallicPriest May 21 '20 at 11:35
  • Why are you running this `nc -l [some_port]` ? – Giorgos Myrianthous May 21 '20 at 17:35
  • This would get a line from a server, which we have established connection with on a TCP port. Here we are redirecting that output (the line read from the server) to a Kafka producer. – MetallicPriest May 21 '20 at 21:30