We have recently decided to start using on-premise Service Fabric and have encountered a 'dependency' problem.
We have several guest executables which have dependencies between them, and can't recover from a restart of the service they are dependant on without a restart themselves.
An example to make it clear:
In the chart below service B is dependant on service A. If service A encounters an unexpected error and gets restarted, service B will go into an 'error' state (which won't be reported to the fabric). This means service B will report an OK health state although it's in an error state.
We were thinking of a solution around these lines:
Raise an independent service which monitors the health state events of all replicas/partitions/applications in the cluster and contains the entire dependency tree.
When the health state of a service changes, it restarts its direct dependencies, which will cause a domino effect of events -> restarts untill the entire subtree has been reset (as shown in the Event-> Action flow chart bellow).
The problem is the healthReport events don't get sent within short intervals of time (meaning my entire system could not work and I wouldn't know for a few a minutes). I would monitor the health state, but I do need to know history (even if the state is healthy now, it doesn't mean it wasn't in error state earlier).
Another problem is that the events could pop at any service level (replica/partition), and it would require me to aggregate all the events.
I would really appreciate any help on the matter. I am also completely open to any other suggestion for this problem, even if it's in a completely other direction.