0

I'm fairly new to working with Kafka and MSK in AWS. I'm using kafkajs to write from a lambda to an MSK cluster. My records are being written successfully to my Kafka cluster, but my client is also logging connection timeout errors into CloudWatch. I'm curious if I could be doing something different in my code to avoid having error logs.

This is my producer code:

const client = new Kafka({ 
    clientId: "client-id", 
    brokers: ["broker1:9092", "broker2:9092"],  // example brokers used here
});

const producer = client.producer({
    idempotent: true
});

const record = {
    topic: "topic1",
    messages: [
        { value: JSON.stringify("message") }
    ]
};

await producer
    .connect()
    .then(async () => await producer.send(record))
    .then(async () => await producer.disconnect())
    .catch(err => throw new Error(JSON.stringify(err)));

And here is an example of the error output:

{
    "level": "ERROR",
    "timestamp": "2022-12-05T20:44:06.637Z",
    "logger": "kafkajs",
    "message": "[Connection] Connection timeout",
    "broker": "[some-broker]:9092",
    "clientId": "[some-client-id]"
}

I'm not sure if I just need to increase my connection timeout in the client or if I'm missing something in the initialization. Like I said, the record still makes it into the cluster, but I'd like to clean up the logs so I don't see this error so often. Has anyone had this issue and solved it? Or is this a normal thing to see when working with MSK and kafkajs?

RusskiT
  • 106
  • 7
  • You could parse the error and silence certain events, if you really wanted to. Or you can add additional properties to the client definition to increase timeouts – OneCricketeer Dec 06 '22 at 15:27
  • It might be interesting to check how long does your Lambda function takes to complete its execution on average. The KafkaJS client uses certain interesting defaults that need to be considered. The `acks` defaults to `-1` which means that all replicas must acknowledge and the `timeout` defaults to `30 seconds`. Since the producer reply is sent asynchronously, the messages are written off into the partitions but the reply doesn't get the chance to come back as the socket connection timed out. – Ricardo Ferreira Dec 07 '22 at 21:36

1 Answers1

0

This isn't an exciting answer, but it turns out that one of the brokers wasn't configured correctly with our Transit Gateway to allow traffic from the VPC. Moral of the story, always check your configurations for broker endpoints.

I was sending data from one account to another using kafkajs in a lambda through Transit Gateway. The code itself worked as intended, but the Gateway configuration wasn't correct to allow traffic from the working account into the MSK Cluster through one of the brokers.

RusskiT
  • 106
  • 7