13

We are running kafka in distributed mode across 2 servers. I'm sending messages to Kafka through Java sdk to a Queue which has Replication factor 2 and 1 partition.

We are running in async mode. I don't find anything abnormal in Kafka logs. Can anyone help in finding out what could be cause?

    Properties props = new Properties();
            props.put("bootstrap.servers", serverAdress);
            props.put("acks", "all");
            props.put("retries", "1");
            props.put("linger.ms",0);
            props.put("buffer.memory",10240000);
            props.put("max.request.size", 1024000);
            props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
            props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

   Producer<String, Object> producer = new org.apache.kafka.clients.producer.KafkaProducer<>(props);

Exception trace:

-2017-08-15T02:36:29,148 [kafka-producer-network-thread | producer-1] WARN producer.internals.Sender - Got error produce response with correlation id 353736 on topic-partition BPA_BinLogQ-0, retrying (0 attempts left). Error: NETWORK_EXCEPTION

Anil Kumar
  • 2,521
  • 7
  • 23
  • 40

2 Answers2

2

You are getting a NETWORK_EXCEPTION so this should tell you that something is wrong with the network connection to the Kafka Broker you were producing toward. Either the broker shutdown or the TCP connection was shutdown for some reason.

Hans Jespersen
  • 8,024
  • 1
  • 24
  • 31
  • Is there a to get the specific reason/trace?. NETWORK_EXCEPTION is way generic, can't identify which went wrong. The brokers was not shutdown for sure – Anil Kumar Aug 18 '17 at 07:06
  • Does the broker logs show anything at the same time? Is the error transient or happens all the time? Are you connected on plaintext port or SSL? – Hans Jespersen Aug 18 '17 at 14:24
  • I dont find anything abnormal in logs at the same time. I'm connecting by giving serverAddress property as serverIP:port . This is first time we got this error – Anil Kumar Aug 19 '17 at 14:24
  • Try TRACE level logging for more details. This is the KIP that added this feature starting in 0.9 https://issues.apache.org/jira/browse/KAFKA-2120. Is it possible you are having network outages? – Hans Jespersen Aug 19 '17 at 16:53
  • Network outrage, there can be. But couldn't report to our team, as we don't have any logs. Thanks,will try TRACE level – Anil Kumar Aug 21 '17 at 10:22
  • 1
    Hey how did you solve this? Would appreciate the help. – Chitresh Sinha Nov 24 '21 at 18:43
  • I got this error on localhost with versions 2.6.3 and 3.0.0 under load. The error required restarting the client. – Peter Lawrey Jan 12 '22 at 10:27
0

A quick code dive shows the most probable cause: lost connection to the upstream broker, what causes the delivery method to fail internally inside a sender (link) - you might want to start logging trace in Sender to confirm that:

    if (response.wasDisconnected()) {
        log.trace("Cancelled request with header {} due to node {} being disconnected",
            requestHeader, response.destination());
        for (ProducerBatch batch : batches.values())
            completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION, String.format("Disconnected from node %s", response.destination())),
                    correlationId, now);
    }

Now with the batch completed in a non-success fashion, it gets retried, but from the logs you have attached it looks like, you ran out of retries (0 attempts left), so it propagates to your level (link):

        if (canRetry(batch, response, now)) {
            log.warn(
                "Got error produce response with correlation id {} on topic-partition {}, retrying ({} attempts left). Error: {}",
                ....
            reenqueueBatch(batch, now);
        }

So the ideas are:

  • investigate your network connectivity - unfortunately this might mean tracing at least on client-side (esp. NetworkClient that does all the upstream broker management) to see if there's any connection loss;
  • increase producer's retries value (though newer versions of Kafka set it to MAX_INT or so).
Adam Kotwasinski
  • 4,377
  • 3
  • 17
  • 40