Minimizing failure without impacting recovery when building processes on top of Kafka

Question

I am working with a microservice that consumes messages from Kafka. It does some processing on the message and then inserts the result in a database. Only then am I acknowledging the message with Kafka.

It is required that I keep data loss to an absolute minimum but recovery rate is quick (avoid reprocessing message because it is expensive).

I realized that if there was to be some kind of failure, like my microservice would crash, my messages would be reprocessed. So I thought to add some kind of 'checkpoint' to my process by writing the state of the transformed message to the file and reading from it after a failure. I thought this would mean that I could move my Kafka commit to an earlier stage, only after writing to the file is successful.

But then, upon further thinking, I realized that if there was to be a failure on the file system, I might not find my files e.g. using a cloud file service might still have a chance of failure even if the marketed rate is that of >99% availability. I might end up in an inconsistent state where I have data in my Kafka topic (which is unaccessible because the Kafka offset has been committed) but I have lost my file on the file system. This made me realize that I should send the Kafka commit at a later stage.

So now, considering the above two design decisions, it feels like there is a tradeoff between not missing data and minimizing time to recover from failure. Am I being unrealistic in my concerns? Is there some design pattern that I can follow to minimize the tradeoffs? How do I reason about this situation? Here I thought that maybe the Saga pattern is appropriate, but am I overcomplicating things?

Even sagas aren't 100% perfect or impervious from exceptions — OneCricketeer, Jul 31 '22 at 00:22
Your question has nothing to do with the cloud. Kafka can also run on prem. Please don't rollback my edit... Also, your question seems not to mention Kafka transactions or idempotence which consumers use for exactly once delivery. — OneCricketeer, Jul 31 '22 at 14:11

aran · Answer 1 · 2022-07-31T15:59:33.727

2

If you are that concerned of data reprocess, you could always follow the paradigm of sending the offsets out of kafka.

For example, in your consumer-worker reading loop: (pseudocode)

while(...)
{
   MessageAndOffset = getMsg();
   //do your things
   saveOffsetInQueueToDB(offset);
}

saveOffsetInQueueToDB is responsible of adding the offset to a Queue/List, or whatever. This operation is only done one the message has been correctly processed.

Periodically, when a certain number of offsets are stored, or when shutdown is captured, you could implement another function that stores the offsets for each topic/partition in:

An external database.
An external SLA backed storing system, such as S3 or Azure Blobs.
Internal (disk) and remote loggers.

If you are concerned about failures, you could use a combination of two of those three options (or even use all three).

Storing these in a "memory buffer" allows the operation to be async, so there's no need for a new transfer/connection to the database/datalake/log for each processed message.

If there's a crash, you could read all messages from the beginning (easiest way is just changing the group.id and setting from beginning) but discarding those whose offset is included in the database, avoiding the reprocess. For example by adding a condition in your loop (yep pseudocode again):

while(...)
{
   MessageAndOffset = getMsg();
   if (offset.notIncluded(offsetListFromDB))
   {
      //do your things
      saveOffsetInQueueToDB(offset);
   }
}

You could implement better performant algorithms instead a "non-included" type one, just storing the last read offsets for each partition in a HashMap and then just checking if the partition that belongs to each consumer is bigger or not than the stored one. For example, partition 0's last offset was 558 and partitions 1's 600:

//offsetMap = {[0,558],[1,600]}

while(...)
{
   MessageAndOffset = getMsg();
   //get partition => 0
   if (offset > offsetMap.get(partition))
   {
      //do your things
      saveOffsetInQueueToDB(offset);
   }
}

This way, you guarantee that only the non-processed messages from each partition will be processed.

Regarding file system failures, that's why Kafka comes as a cluster: Fault tolerance in Kafka is done by copying the partition data to other brokers which are known as replicas.

So if you have 5 brokers, for example, you must experience a total of 5 different system failures at the same time (I guess brokers are in separate hosts) in order to lose any data. Even 4 different brokers could fail at the same time without losing any data.

All brokers save the same amount of data, same partitions. If a filesystem error occurs in one of the brokers, the others will still hold all the information:

edited Jul 31 '22 at 15:59

answered Jul 31 '22 at 03:40

aran

10,978
5
39
69

1

Thanks. It is an interesting solution. I understand this would help data recovery, but what happens to being robust to failure? The file system point of failure still exists. – Zeruno Jul 31 '22 at 10:14
You're also introducing a failure point of the database client – OneCricketeer Jul 31 '22 at 14:10
Retry handling when inserting data to a database can be easily done, and as the offsets are also stored in memory before moving them to the DB, they could easily be maitaned in mem, flushed to both logs in disk and to a remote logging server, in order to be stored by duplicate. Or even better, if using some SLA backed Sservice such as S3,(f.e), the saving of offsets won't be affected by connectiviy/failure issues – aran Jul 31 '22 at 14:58
1

S3, historically, has still had downtime. Local volumes can also fail. Or there can be external runtime exceptions that crash the app before a memory buffer can be cleared. And even still, someone could trip over the power cord in the data center and cause the app to forcibly terminate. There's no perfect scenario to 100% prevent data loss – OneCricketeer Jul 31 '22 at 15:44
1

Totally agreed with you here. The point is trying to avoid it, but taking into account that no one can ever guarantee there won't be a failure. Not even vendors, as the 99,9999% SLA tells...there's still a chance for failure. The point is trying to minizime the risk, and replication is a feasible possibility – aran Jul 31 '22 at 15:47
@Zeruno regarding file system failures, that's why Kafka comes as a cluster: Fault tolerance in Kafka is done by copying the partition data to other brokers which are known as replicas. So if you have 5 brokers, for example, you must experience a total of 5 different system failures (I guess brokers are in separate hosts) to lose any data. Even 4 different brokers could fail at the same time without losing any data. – aran Jul 31 '22 at 15:56
@aran I understand how Kafka protects against failures. My concern is about introducing the file system into the equation. If the kafka cluster fails, we have a problem. But if at least one of the kafka cluster or the file system cluster fails, we have a problem. Does this justify committing to the kafka cluster at a later stage? Then are we losing on recovery? Exactly, the idea is to minimize data loss. – Zeruno Jul 31 '22 at 16:10
But if just one broker fails, you won't have any problems. For example for a cluster of 5 brokers, where every kafka broker owns its own filesystem (don't dockerize the brokers),you will need a total of 5 different filesystem failures in order to lose data. If one fails, nothing happens. A replication factor of 5 means all 5 brokers will hold all the information. So, in summary, there's no need of commiting later the offsets (commit them periodically, in batches, and out of the consumer thread). The chance of total filesystem failure (for different nodes) is minimum, and out of our scope – aran Jul 31 '22 at 16:15
What if the kafka broker does not own its filesystem? We have two independent failure points. – Zeruno Jul 31 '22 at 16:19
1

Then there's no guarantee at all, and such a deployment shouldn't be in production. If each own has its own filesystem, replication makes sense; If not, is just duplicating data in the same filesystem. – aran Jul 31 '22 at 16:21
I mean what if the kafka cluster file system and the file system my application is writing a plain file to are independent clusters? This is the dilemma behind my question. Do I update the kafka offset after writing to file or do I commit after my entire process is completed (writing to database)? – Zeruno Jul 31 '22 at 16:26
Given the paradigm on the answer (commiting the offset only after correctly processed), and if you don't lose any data due to replication, then you don't have to care about application's filesystem: you could store the processed offset numbers in disk(apps filesystem), in a remote database (which has another filesystem), in a remote logging server (with its own filesystem as well) , and so on.... so you avoid the single point of failure (by combining all or some of the methods to store offsets) and minimize the risk at minimum. – aran Jul 31 '22 at 16:30
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/246935/discussion-between-zeruno-and-aran). – Zeruno Jul 31 '22 at 16:33

Minimizing failure without impacting recovery when building processes on top of Kafka

1 Answers1