Detecting crashes of Azure instances

Question

I want to detect the fact that an instance of my Azure role has crashed. Detection in my case means that another instance of my role is notified about the crash. Please review my idea explained below or propose another solution.

The idea I came up with takes advantage of the fact that items in the Azure Queue have limited processing time.

Configure an Azure Queue. All instances of the role listen to this queue.
Configure role instances to have internal endpoint
When instance A starts it posts a message to the queue. The message contains the id of instance A, the IP of A's internal endpoint, the marker that this message should be forwarded back to A.
Most likely the message ends up on another instance B. B will forward the MessageId and PopReceipt to A via internal endpoint. Instance A creates a object of CloudQueueMessage using this ctr http://msdn.microsoft.com/en-us/library/dn451949.aspx.
Instance A starts updating the visibility timeout of received message infinitely. From Azure Queue point of view this message will be being processed for a very long time. In the first update A removes "forward-this-message" marker.
If instance A crashes it stops prolonging the processing. The message will become visible automatically for other instances soon.
Instance C picks up the message and learns about crashed A: message contains the ID of instance A and no "forward-this-message" marker.
If instance A stops gracefully it marks its queue message as processed.

What happens when instance A fails to update visibility timeout in step #5? — Gaurav Mantri, Oct 22 '13 at 13:08
Instance A should start updating the visibility timeout in advance to have time for possible retries. This will mitigate transient problems. If it desperately fails to update it should probably commit suicide -- ask Azure to recycle the instance. — SergeyS, Oct 22 '13 at 14:02
Can you expose public and private endpoints on these instances to ping externally? — Igorek, Oct 22 '13 at 19:33
Theoretically it is possible to have instance input endpoint for external pings — SergeyS, Oct 23 '13 at 11:14

score 0 · Answer 1 · answered Oct 22 '13 at 19:39

0

This all seems very convoluted.

Personally, I would go back and look at the original assumption that I need to know when an instance crashes - and consider what I do with that information. I would favor an optimistic solution (i.e., assume success and handle failure) rather than the pessimistic solution (i.e., assume failure so provide some mechanism to ensure success). One problem with the latter is that you are going to have to handle undeclared instance crashes anyway - so why not make that the default behavior. That is invoke the operation on the instance - and handle any failure that occurs.

For example, if I want to invoke an operation on an internal endpoint on another instance I would load balance against all the other instances and, on detecting a failed instance, try the operation on another instance. Ryan Dunn has what is now an ancient post on, among other things, load balancing against internal endpoints.

My basic point is that it is going to be hard to robustly perform this type of orchestration with messages being passed from one instance to another. There are just too many possible failure points. It would be better to come up with a solution that more directly addresses the underlying need. A simple solution is almost always preferable to a more complex solution.

answered Oct 22 '13 at 19:39

Neil Mackenzie

2,817
14
11

I want to ensure reliable workflow execution: once accepted by an instance the workflow must be completed regardless of possible crashes of the instance. One way to achieve that is to have Azure Queue message for each workflow. It is a classic solution. But it is slow and costly. I can reduce the amount of messages in the queue by tracking instances instead of individual workflows. – SergeyS Oct 23 '13 at 08:11
When instance C learns about crashed A it will query the database to get the list of workflows that instance A was working on. Instance C takes over pieces of work from A by updating the database records. When all work is taken over C marks "A crashed" message as "done". If Instance C crashes in turn - "A crashed" message will be automatically returned to the queue and will be picked up by another instance. – SergeyS Oct 23 '13 at 08:12
The Queue service works at up to 2,000 messages per second per queue so unless these workflows are sub-second I don't really see why they should impact behavior - particularly given the grief that not using a queue appears likely to cause. Have you considered using a single message per workflow and keeping track of an isolatable message status by updating the message? For example, the workflow may comprise 3 steps - and you just update the message when each step is completed. With a queue, you would of course have to handle idempotency. – Neil Mackenzie Oct 23 '13 at 15:49
Yes, a solution with queue message per workflow was considered and was ruled out as too slow. Our service has to do with real-time communication. Latency matters. – SergeyS Oct 23 '13 at 19:32

Detecting crashes of Azure instances

1 Answers1