Azure Compute Service worker becomes "busy" following scale-up

Question

I'm running one service in Azure with 4 worker instances. When I scale up to 5 worker instances the first instance that had started goes into the "busy" state. Why is that? What happens during scale up? Does azure re-run all the startup tasks? I'm very confused and can't seem to find any documentation on this.

After scaling up to 5 instances the first instance changes its status to:

Busy (Waiting for role to start... Application startup tasks are running. [2014-08-12T18:36:52Z])

And the java process that was running there stops. Why would this happen?!

Any help would be appreciated.

Startup.cmd

REM   Log the startup date and time.
ECHO Startup.cmd: >> "%TEMP%\StartupLog.txt" 2>&1
ECHO Current date and time: >> "%TEMP%\StartupLog.txt" 2>&1
DATE /T >> "%TEMP%\StartupLog.txt" 2>&1
TIME /T >> "%TEMP%\StartupLog.txt" 2>&1

REM enable ICMP
netsh advfirewall firewall add rule name="ICMPv6 echo" dir=in action=allow enable=yes protocol=icmpv6:128,any

ECHO Starting WebService >> "%TEMP%\StartupLog.txt" 2>&1
tasklist /FI "IMAGENAME eq java.exe" 2>NUL | find /I /N "java.exe" >NUL 2>&1
if "%ERRORLEVEL%"=="0" GOTO running

SET %ERRORLEVEL% = 0
START /B java -jar WEB-SERVICE-1_0--SNAPSHOT.jar app.properties >> "%TEMP%\StartupLog.txt" 2>&1

:running
SET %ERRORLEVEL% = 0

score 5 · Answer 1 · answered Aug 12 '14 at 23:09

5

During a scale operation Azure will send a RoleEnvironmentTopologyChange via the Changing event to all existing instances. This lets those instances discover the new role instance in order to allow communication between the instances. Note that this only happens if you have an internal endpoint defined (if you turn on RDP then you implicitly get an internal endpoint).

By default these topology changes won't affect running instances. However, if you subscribe to the Changing event and you set e.Cancel=True then the role instance will recycle and run your startup tasks again.

For more information on the topology change see http://azure.microsoft.com/blog/2011/01/04/responding-to-role-topology-changes/.

So there are two issues here:

Why is your role not able to recover from a recycle? This is a significant issue and one you must fix in order to have a reliable service. You can start with the troubleshooting workflows at http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx, and in particular Scenario 3 at http://blogs.msdn.com/b/kwill/archive/2013/09/06/troubleshooting-scenario-3-role-stuck-in-busy.aspx.
Why are you recycling your role instances in response to a topology change? Check your Changing event handler and make sure you aren't setting e.Cancel=true.

answered Aug 12 '14 at 23:09

kwill

10,867
1
28
26

Also - make sure that you really are looking at one of the "first" instances. In the portal, the instance display order will change as the process unfolds. Expand the "NAME" column and you can see the instance order of creation (_x at end of name). Also, on the far right, look at the Update & Fault domains for more clarity. – viperguynaz Aug 12 '14 at 23:22
This is actually a Java app and I'm using the Azure plugin in Eclipse. I have a startup.cmd and a run.cmd. In the startup.cmd I am basically doing java.exe -jar app.jar with some logging. The plugin generates all the XML for me so I am not sure how to handle those extra RoleEnvironment events like "changing". All I currently have access to in terms of configuration is this: http://msdn.microsoft.com/en-us/library/azure/gg557552.aspx I am definitely not explicitly setting Cancel=true anywhere. In run.cmd I'm just passing java.exe to util/whileproc.cmd – bjoern Aug 13 '14 at 00:13
Hey Kevin, really appreciate your time and realized that I had actually already read your articles on the topic. It would be fantastic if you could take a look at my startup script (i updated my question) and let me know if I am doing anything wrong there. Really appreciate the help. – bjoern Aug 14 '14 at 17:57
1

I don't see anything obviously wrong in your startup task. That scenario 3 blog post talks about reading the WaHostBootstrapper logs to determine where the role startup is stuck, so that is where I would start. – kwill Aug 14 '14 at 21:23
I have a similar issue with an ASP.NET web role. I neither have an event implementation, but even if I would, there is a role stop then Guest Agent initialization, which takes about 10 minutes until it reaches role start. So for 10 minutes no code of mine is running, the environment is restarting. @kwill is this something by design? Should I open a new question? – Piedone Jan 29 '17 at 16:32

score 0 · Answer 2 · answered Jan 29 '17 at 17:50

This is too long for a comment, just adding to what kwill has already told:

My ASP.NET Web Role didn't have e.Cancel = true anywhere but still got restarted (or rather: recycled, the environment being completely re-initialized even before OnStart() was called for 10 minutes, just like after a fresh deployment) after a scale-out. So I went ahead and added an event handler which is just supposed to set what's already a default:

public class WebRole : RoleEntryPoint
{
    public override bool OnStart()
    {
        RoleEnvironment.Changing += (sender, e) =>
        {
            if (e.Changes.Any(change => change is RoleEnvironmentTopologyChange))
            {
                e.Cancel = false;
            }
        };
    }
}

And this helped! The role still becomes busy, but just for a few seconds instead of 15-20 minutes. It seems that only the website in the role restarts (or maybe the whole IIS), but the role doesn't restart, neither is the whole environment reinitialized.

Azure Compute Service worker becomes "busy" following scale-up

2 Answers2

Linked