involuntary disruptions / SIGKILL handling in microservice following saga pattern

Question

Should i engineer my microservice to handle involuntary disruptions like hardware failure? Are these disruptions frequent enough to be handled in a service running on AWS managed EKS cluster.
Should i consider some design change in the service to handle the unexpected SIGKILL with methods like persisting the data at each step or will that be considered as over-engineering?

What standard way would you suggest for handling these involuntary disruptions if it is
a) a restful service that responds typically in 1s(follows saga pattern). b) a service that process a big 1GB file in 1 hour.

score 0 · Answer 1 · answered Jan 05 '22 at 13:02

There are couple of ways to handle those disruptions. As mentioned here here:

Here are some ways to mitigate involuntary disruptions:

Ensure your pod requests the resources it needs.

Replicate your application if you need higher availability. (Learn about running replicated stateless and stateful applications.)

For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)

The frequency of voluntary disruptions varies.

So:

if your budget allows it, spread your app accross zones or racks, you can use Node affinity to schedule Pods on cetrain nodes,
make sure to configure Replicas, it will ensure that when one Pod receives SIGKILL the load is automatically directed to another Pod. You can read more about this here.
consider using DaemonSets, which ensure each Node runs a copy of a Pod.
use Deployments for stateless apps and StatefulSets for stateful.
last thing you can do is to write your app to be distruption tolerant.

I hope I cleared the water a little bit for you, feel free to ask more questions.

These points are present in the link mentioned in the question also. These points talk about disruption handling for the application, running with multiples replicas to keep it available. But the questions intends to ask about the one replica that was affected by hardware failure. — new__1, Jan 06 '22 at 11:44
You asked about design changes, I offered some you may consider on kubernetes side. If you want to run a single Replica - its on AWS or your code to handle those disruptions. — mdobrucki, Jan 10 '22 at 12:00

involuntary disruptions / SIGKILL handling in microservice following saga pattern

1 Answers1