Managing guest executables dependencies - On premise Service Fabric

Question

We have recently decided to start using on-premise Service Fabric and have encountered a 'dependency' problem.

We have several guest executables which have dependencies between them, and can't recover from a restart of the service they are dependant on without a restart themselves.

An example to make it clear:

In the chart below service B is dependant on service A. If service A encounters an unexpected error and gets restarted, service B will go into an 'error' state (which won't be reported to the fabric). This means service B will report an OK health state although it's in an error state.

We were thinking of a solution around these lines:

Raise an independent service which monitors the health state events of all replicas/partitions/applications in the cluster and contains the entire dependency tree.

When the health state of a service changes, it restarts its direct dependencies, which will cause a domino effect of events -> restarts untill the entire subtree has been reset (as shown in the Event-> Action flow chart bellow).

The problem is the healthReport events don't get sent within short intervals of time (meaning my entire system could not work and I wouldn't know for a few a minutes). I would monitor the health state, but I do need to know history (even if the state is healthy now, it doesn't mean it wasn't in error state earlier).

Another problem is that the events could pop at any service level (replica/partition), and it would require me to aggregate all the events.

I would really appreciate any help on the matter. I am also completely open to any other suggestion for this problem, even if it's in a completely other direction.

When you are referring to "healthReport" are you talking about reporting replica state to Service Fabric? — Oleg Karasik, Mar 28 '19 at 06:53

score 1 · Answer 1 · answered Mar 31 '19 at 18:12

Cascading failures in services can generally be avoided by introducing fault tolerance at the communication boundaries between services. A few strategies to achieve this:

Introduce retries for failed operations with a delay in between. The time between delays may grow exponentially. This is an easy option to implement if you are currently doing a lot of remote procedure call (RPC) style communication between services. It may be very effective if your dependent services don't take too long to restart. Polly is a well-known library for implementing retries.
Use circuit breakers to close down communications with failing services. In this metaphor, a closed circuit is formed between two services communicating normally. The circuit breaker monitors the communications. If it detects some number of failed communications, it 'opens' the circuit, causing any further communications to fail immediately. The circuit breaker then sends periodic queries to the failing service to check its health, and closes the circuit once the failing service becomes operational. This is a little more involved than retry policies since you are responsible for preventing an open circuit from crashing your service, and also for deciding what constitutes a healthy service. Polly also supports circuit breakers
Use queues to form fully asynchronous communication between services. Instead of communicating directly from service B to A, queue outbound operations to A in service B. Process the queue in its own thread - do not allow communication failures to escape the queue processor. You may also add an inbound queue to service A to receive messages from service B's outbound queue to completely isolate message processing from the network. This is probably the most durable but also the most complex as it requires a very different architecture from RPC, and you must also decide how to deal with messages which fail repeatedly. You might retry failed messages immediately, send them to the back of the queue after a delay, send them to a dead letter collection for manual processing, or drop the message altogether. Since you're using guest executables you don't have the luxury of reliable collections to help with this process, so a third party solution like RabbitMQ might be useful if you decide to go this way.

How large is the API surface of your guest executables? Could you potentially wrap them by writing another service which routes calls to the guest executable? You could inject the fault tolerance there. — abarger, Apr 02 '19 at 16:56
That was one of the options, but then we get a triple wrapper which is kind of disgusting. — Rohi, Apr 06 '19 at 09:07

Managing guest executables dependencies - On premise Service Fabric

1 Answers1