I am working with a microservice that consumes messages from Kafka. It does some processing on the message and then inserts the result in a database. Only then am I acknowledging the message with Kafka.
It is required that I keep data loss to an absolute minimum but recovery rate is quick (avoid reprocessing message because it is expensive).
I realized that if there was to be some kind of failure, like my microservice would crash, my messages would be reprocessed. So I thought to add some kind of 'checkpoint' to my process by writing the state of the transformed message to the file and reading from it after a failure. I thought this would mean that I could move my Kafka commit to an earlier stage, only after writing to the file is successful.
But then, upon further thinking, I realized that if there was to be a failure on the file system, I might not find my files e.g. using a cloud file service might still have a chance of failure even if the marketed rate is that of >99% availability. I might end up in an inconsistent state where I have data in my Kafka topic (which is unaccessible because the Kafka offset has been committed) but I have lost my file on the file system. This made me realize that I should send the Kafka commit at a later stage.
So now, considering the above two design decisions, it feels like there is a tradeoff between not missing data and minimizing time to recover from failure. Am I being unrealistic in my concerns? Is there some design pattern that I can follow to minimize the tradeoffs? How do I reason about this situation? Here I thought that maybe the Saga pattern is appropriate, but am I overcomplicating things?