Gracefully take a server out of Azure Load Balancer (drain stop)

Question

We have an application deployed to Azure IaaS VMs, served by IIS. In order to install updates, we need to take each machine out of the load balancer, one by one. Before moving to Azure, we were using Microsoft NLB which has the function to DRAIN STOP a node - by not sending new connections, but keep the existing connections open until they complete. How can we achieve the same with Azure LB?

score 14 · Accepted Answer · answered May 01 '15 at 15:35

14

The recommended way to do this is to have a custom health probe in your load balanced set. For example, you could have a simple healthcheck.html page on each of your VM's (in wwwroot for example) and direct the probe from your load balanced set to this page. As long as the probe can retrieve that page (HTTP 200), the Azure load balancer will keep sending user requests to the VM.

When you need to update a VM, then you can simply rename the healthcheck.html to a different name such as _healthcheck.html. This will cause the probe to start receiving HTTP 404 errors and will take that machine out of the load balanced rotation because it is not getting HTTP 200. Existing connections will continue to be serviced but the Azure LB will stop sending new requests to the VM.

After your updates on the VM have been completed, rename _healthcheck.html back to healthcheck.html. The Azure LB probe will start getting HTTP 200 responses and as a result start sending requests to this VM again.

Repeat this for each VM in the load balanced set.

answered May 01 '15 at 15:35

Rick Rainey

281
2
5

Thank you @Rick, I implemented this and it works as expected! – nulldotzero Nov 03 '15 at 09:32
Rick - I have never setup a "custom health probe". Would you say this is the best article to follow? http://blogs.msdn.com/b/piyushranjan/archive/2014/01/09/custom-probe-for-iaas-load-balanced-sets-in-windows-azure-and-acl-part-2.aspx – aron Dec 03 '15 at 04:18
It's going to depend on your application. Each application will have different dependencies on other resources (databases, caches, web services, etc). This is a very nice post to get you started though. – Rick Rainey Dec 03 '15 at 14:26
Is there a way to see whether this has worked as expected? I know how to setup an HTTP LB probe, but there appears to be no way to know whether the probe is currently in a failed or success state ... ? Seems important? – John Hargrove Mar 10 '16 at 00:03
For ARM deployed virtual machines you can view the health-probe logs. Details here: https://azure.microsoft.com/en-us/documentation/articles/load-balancer-monitor-log/ – Rick Rainey Mar 10 '16 at 13:14
Please note: This solution does NOT work for an http web server. Even if you have session persistence set to `Client IP and protocol` when you remove the `healthcheck.html` page from Node A then immediately your requests will be redirected to Node B. If you have an actual connection open - say downloading a large file then that will complete but this won't let you take an HTTP webserver node down for maintenance. – Simon Apr 25 '17 at 19:59
@JohnHargrove If you aren't able to write your own status page that pulls from the diagnostic logs I think you're best off creating a 'simple' status page that just pings your own probes and reports their availability. This won't help you much if you're trying to diagnose teething problems with an initial setup, but once you're up and running it can help verify the status. I'm absolutely astonished that they don't show the current status next to the probe. As far as I can tell when it's added to the dashboard the icon is always the same too :-( – Simon Apr 25 '17 at 20:05
correction: when I say `this won't let you take an HTTP webserver node down for maintenance` what I really meant is that existing sessions on the node you take down will go to another node. This may not be a problem for your architecture, but if you're using MS session state server stored on localhost you will lose those sessions and the user will get a session expired notification – Simon Apr 26 '17 at 18:25
This solution is complicated by the fact that the Azure LB doesn't seem to allow you to run more that one HTTP load-balancer rule on the same port, so if you want to run multiple probes (one to test that your homepage is up and running and another to test for healthcheck.html) then you can't do it.. or at least, I haven't been able to figure out how. My workaround is to create a /monitor/healthcheck MVC route which will check the homepage and for the existence of the healthcheck file, and then point the Azure LB to this endpoint instead. Not ideal. – user72964 Oct 17 '17 at 11:12
1

In our experience, the health probe method does NOT work as expected. We have 3 IIS web servers behind an LB. We make a change in the DB which causes our health check page to return 503s. We can see in Diagnostic Logs that the load balancer registered the change. Yet we can prove without a doubt that requests continue to arrive at the "disabled" web server. IIS's Worker Processes view shows this, as well as IIS logging. The health probe Intervals are set to 15s with an Unhealthy Threshold of 2. I've sat for 10m waiting for requests to stop coming in. I can't explain it. – Larry Silverman Jan 12 '18 at 15:37
I'll also add that Diagnostics output is the only place we've been able to determine whether the load balancer has witnessed a change in health probe. We've noted that logging to Log Analytics/OMS and also to blob storage can take up to 10 minutes. We haven't tried Event Hub yet. – Larry Silverman Jan 12 '18 at 19:19
@LarrySilverman you can check the IIS server logs and see the HTTP code returned to the load balancer. IF requests continue to arrive at the server it means that someone is not passing through the load balancer and is going to the server directly. – nulldotzero May 31 '18 at 17:13
@nulldotzero It's not possible for clients to bypass the load balancer to access the server directly. The requests are coming through the load balancer, without a doubt. – Larry Silverman Jun 01 '18 at 13:59
1

@larry Silverman is this an existing tcp connection or a new one? Remember the slb doesn't close existing tcp connections so if a client was connected before it would still get connected back to the now unhealthy instance. Best way to repro this would be to run in powershell from an outside machine while($true) {iwr ; start-sleep - seconds 10}, now go and mark the instance down, you will notice that above ps would start to fail and logging wi show request made it to the down instance, as the session persisted. Repeat same iwr with -disablekeepalive – Anirudh Goel Oct 18 '18 at 05:57
Then you would see that new connections are not being made to the down instance.. – Anirudh Goel Oct 18 '18 at 05:58
@AnirudhGoel That's insightful. So keepalive connections stay alive and HTTP requests will keep coming from already-connected clients. So how DO you drain stop an IIS node given this architecture? Is stopping the app pool "graceful"? – Larry Silverman Oct 18 '18 at 18:37
1

I stumbled upon here while searching answer for that very question, so High Five! It seems best would be for the application code to detect this shutdown event and start closing tcp connections by adding "Connection:close" header, it's an equivalent of draining tcp connections.. Also I am exploring this option https://serverfault.com/a/284199/83705 will keep you posted on how it looks like. – Anirudh Goel Oct 18 '18 at 18:40
It would have been ideal if the slb itself gave an option to close existing tcp connections gracefully, I don't know why it is not there. – Anirudh Goel Oct 18 '18 at 18:42

score 4 · Answer 2 · answered Aug 09 '18 at 21:38

4

In their documentation, Microsoft recommends using a Security Group to explicitly block the health probe. All Azure Load Balancer health probes will come from 168.63.129.16.

An example would be using an incoming NSG rule to deny 168.63.129.16 to destination of the VM NIC that you want to remove from the pool.

answered Aug 09 '18 at 21:38

Jeff Miles

2,065
2
19
27

We do this and it works quite well. – Justin Dec 09 '20 at 19:51
Do you have an idea how this approach would apply for an Application Gateway instead of a Load Balancer? We tried to do the same (doing the required adaptations for an Application Gateway), but apparently there is a rule validation that prevents to block the health probes for a specific VM: Failed to update security rule. Blocks incoming internet traffic on ports 65200 - 65535 to subnet. This is not permitted for Application Gateways that have V2 Sku. – Sergey Potapov Feb 05 '21 at 14:53

Gracefully take a server out of Azure Load Balancer (drain stop)

2 Answers2

Linked