1

I have an Azure webrole project which involves a long startup task of installing 3rd party software on the instance; Occasionally, I've seen instances that don't respond, so I'm implementing a probe, for the load balancer to take note of this and not direct traffic to bad instances. This of course isn't enough - what I'd want is for Azure (Fabric?) to then reboot the instance, and if that doesn't help (that is, make the instance reply properly to the probe) - reimage the instance. Is that the behavior, and if so, where is that documented? I searched for quite a while but didn't find anything useful.

Thanks

2 Answers2

0

Using the management API you should be able to externally monitor your role instances. Then, if one is taking to long you should be able to force it to be re-imaged.

BrentDaCodeMonkey
  • 5,493
  • 20
  • 18
0

http://blogs.msdn.com/b/kwill/archive/2013/02/28/heartbeats-recovery-and-the-load-balancer.aspx describes the health of a role instance, what Azure does for recovery, and how to use a load balancer probe.

When you say that your instance doesn't respond, does that mean that the instance shows as Busy (or something besides Ready) in the portal, or just that IIS isn't responding to requests? If the former (instance showing Busy) then you don't need a load balancer probe since Azure will automatically remove that instance from rotation. If the latter (IIS not responding) then you can potentially implement a StatusCheck event in your web code such that if w3wp itself is having a problem then the instance will be taken out of rotation by the fabric, but if w3wp itself is healthy and it is just the requests that are not responding then you will need the load balancer probe.

Having a good monitoring and recovery solution in place is very valuable, but I would recommend that instead of rebooting instances to mitigate a w3wp problem you should instead investigate the root cause of why your instances aren't responding. Fix the source of the problem rather than apply a Band-Aid :). The blog post at http://blogs.msdn.com/b/kwill/archive/2013/02/28/heartbeats-recovery-and-the-load-balancer.aspx, and in particular the troubleshooting scenario 5, may be a good place to start the investigation.

kwill
  • 10,867
  • 1
  • 28
  • 26
  • Thanks! the instances show on the Azure portal as "Ready". When I RDP and browse from them locally, they return HTTP 400, and when I browse externally to the service's URL, I get "Oops! Google Chrome could not connect". So it appears that the probe's failure doesn't cause instances to be considered bad. – user2120679 Jun 29 '14 at 10:42
  • do you think I should return a 500 error, which says the server is to blame and not the client? – user2120679 Jun 29 '14 at 10:42