2

I am considering setting the retries mechanism to cover for network blips here and there, for which I think if the retries mechanism covers a few mins, say 2-5, that will be enough for minor network issues. As per the answer to this question and the docs the configs to be set are mainly retries, max.in.flight.requests.per.connection (recommended to be set to 1 by kafka), retry.backoff.ms and delivery.timeout.ms.

My concern is that setting max.in.flight.requests.per.connection to 1 might have performance implications? Anyone has experience with that? What is the default number of connnections a kafka producer makes with a broker cluster? I couldn't find something online about it.

Dev2017
  • 857
  • 9
  • 31

1 Answers1

3

max.in.flight.requests.per.connection

Indeed, this is one of the most important configuration parameters regarding producer's performance, specifically producer's throughput and latency. This parameter controls the maximum number of unacknowledged requests the producer will send to a certain partition on a single connection before blocking.

In other words, it will send a single request, and until acks are received, it won't send another request to the broker (for that partition). As a suggestion, if you don't require all messages to be ordered, do not set this parameter to 1.

Regarding retries and its link to this param:

Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first.

So it's not really recommended to be set to 1 by kafka; It is recommended when you require an ordered delivery. If you don't require this to happen, do not set max.in.flight.requests.per.connection to 1, as your producer's throughput will indeed be decreased.

In resume: set it to one only if you are looking for ordered delivery of event.

In this test, throughput and latency show a decent improvement when increasing max.in.flight.requests from 1 to just 2.

Throughput enter image description here

Latency enter image description here


acks

There is another param that also is involved here, together with those you already quoted, the number of acks set.

For example, acks = 0 will make both retries and max.in.flight params be completely irrelevant, as the producer will not wait for any ack from any broker, and will assume every request was successfull. Just like an UDP sender.

With acks=0:

1- retries does not take effect as there is no way to know if any failure occurred.

2- max.in.flight does not take effect as there is no possible unacknowledged requests whatsoever.

Setting the acks higher than 0, for example, acks=2, will have a direct impact on the performance as well, because for a request to be identified as successfull, 2 acks will have to be received from the cluster. This means, for example, that the blocking time of a producer which specifies only 1 in flight request will usually increase, as it will have to wait for 2 ack messages before unblocking and being able to send the next request for that partition.


Idempotence

There's another concept regarding your question, which is the idempotent producer. This may be the optimal option to achieve a balance between performance and efficiency.

Let's imagine you set some retries in order to guarantee a message arrived properly. The broker receives the message, and when it sents you the ack, a network error makes your producer not to receive it. If retries are set, the producer will send again the same message, creating a duplicate message in the broker.

Kafka 0.11.0 includes support for idempotent and transactional capabilities in the producer. Idempotent delivery ensures that messages are delivered exactly once to a particular topic partition during the lifetime of a single producer

An idempotent producer has a unique producer ID and uses sequence IDs for each message, which allows the broker to ensure it is committing ordered messages with no duplication, on a per partition basis.

This idempotent producer, in newer versions of Kafka clients, come as default with 5 max.in.flight.requests, increasing the performance from the "old" way to ensure delivered order. That's also the max value for the idempodent producer (from 1 to 5 is the valid range of in flight requests),It is, in resume, the best option if you require an ordered, safe pipeline, while keeping the producer's perfomance high.

The idempotent producer leads to the exactly once semantics concept, explained deeper in the link.


Design for max.in.flight > 1 with idempotence enabled

enter image description here


In resume, you should judge what your use case's requirements are. Questions like:

Is ordered delivery required?

Are duplicate messages something acceptable?

Do you value throughput and latency to a point where some messages lost/unordered is acceptable?

May the idempotent producer be the answer to your requirements, in order to achieve a balance between performance and message ordering/successfull request guarantees?


This presentation resumes more or less the impact of these configurations on the producer side, it's worth having a look.

aran
  • 10,978
  • 5
  • 39
  • 69
  • According to this question: https://stackoverflow.com/questions/55192852/transactional-producer-vs-just-idempotent-producer-java-exception-outoforderseq idempotence has issues with retries and it is recommended that retries are left unset if idempotence it oe used. How will idempotence cover for retries? – Dev2017 Feb 08 '21 at 07:49
  • 1
    The idempodent producer defaults Integer.MAX_VALUE as its retry parameter, which means, it will make sure the message has arrived. This is the value recommended not to be altered, as this ensures that transient errors are retried indefinitely (well, almost...) Integer.MAX_VALUE is 2147483647 retries on the same message. – aran Feb 08 '21 at 07:57
  • 1
    The suggestion is, if you set an idempotent producer, do not change the default retries. – aran Feb 08 '21 at 07:59
  • There is the OutOfOrderSequenceException to take care of, how should that be handled? – Dev2017 Feb 08 '21 at 07:59
  • Is OutOfOrderSequenceException a transient error? – Dev2017 Feb 08 '21 at 08:01
  • 1
    That's because OP of that question altered the retries; So one of the messages wasn't delivered correctly, hence the broker received an unexpected sequence number from the producer, which means that he lost data. That's why the retries should not be altered. It means that after just 5 retries, the message is discarded and the broker received, for example, the message nº10, but not the failed nº9. – aran Feb 08 '21 at 08:01
  • 1
    No, a transient error is what the idempodent producer avoids; For example, a network error for 2 minutes is a transient error; The idempodent producer will retry during that time, making sure the message arrive properly once the error dissapears. That's not guaranteed to happen with any other producer with, for example, just 3 retries set: it will lost data during the transient failure. – aran Feb 08 '21 at 08:03
  • 1
    https://www.linkedin.com/learning/learn-apache-kafka-for-beginners/producer-part-3-safe-producer --- a short read, is really well explained here – aran Feb 08 '21 at 08:07
  • hmm. What happens if there is a network issue and the producer is retrying for message A, the network issue resolves and we are waiting for 100ms between 1 retry and another and meanwhile message B is attempted? This will throw an OutOfOrderSequenceException which is not transient and won't be retried. Does that mean that only message A will be sent and no subsequent messages (C, D etc.) will be sent since they will be all out of sequence as message B wasn't sent? – Dev2017 Feb 09 '21 at 06:39
  • You are right it will be an OutofOrder, but will indeed be transient and retriable. The following in.flight messages will be discarded (C,D..) until A is correctly received. Take a look here, points 5 and 6 https://docs.google.com/document/d/1EBt5rDfsvpK6mAPOOWjxa9vY0hJ0s9Jx9Wpwciy0aVo/edit – aran Feb 09 '21 at 22:39
  • If a produce request fails, succeeding in flight batches will also fail with an OutOfOrderSequenceException. As such, if the sequence number of a batch is not the successor of the last ack’d sequence, and if it fails with an OutOfOrderSequenceException, we consider this to be retriable. --- . As retries are greater than 0 the producer will "reset" to the failure point, and be able to produce B,C,D in order. – aran Feb 09 '21 at 22:40
  • Updated the answer with the proposal for that use case – aran Feb 09 '21 at 22:54