4

I have been having some difficulties in identifying the right configurations for effectively scaling my cloud service. I am assuming we just have to use the scale section of the management portal and nothing programmatically? My current configuration for Web Role is

Medium sized VM (4 GB RAM) Autoscale - CPUInstance Range - 1 to 10Target CPU - 50 to 80Scale up and down by 1 instance at a timeScale up and down wait time - 5 mins

I used http://loader.io/ site to do load testing by sending concurrent requests to an API. And it could support only 50 -100 users. After that I was getting timeout(10 secs) errors. My app will be targeting millions of users on a huge scale, so am not really sure how I can efficiently scale to cater to that much load on the server.

I think the problem could be the scale up time which is 5mins(i think its very high), and in management portal, the lowest option is 5mins, so dunno how i can reduce it?

Any suggestions?

Bitsian
  • 2,238
  • 5
  • 37
  • 72

2 Answers2

6

Azure's auto-scaling engine examines 60-minute cpu-utilization averages every 5 minutes. This means that every 5 minutes it has a chance to decide if your CPU utilization is too high and scale you up.

If you need something more robust, I'd recommend to think about the following:

  • CPU Usage is rarely a good indicator for scaling of websites. Look into Requests/sec or requests/current instead of CPU utilization.
  • Consider examining the need to scale more frequently (every 1 min?) Azure portal cannot do this. You'll need either WASABi or AzureWatch for this
  • Depending on your usage patterns, consider looking at shorter time averages to make a decision (ie: average over 20 minutes not 60 minutes). Once again, your choices here are WASABi or AzureWatch
  • Consider looking at the /rate/ of increase in the metrics and not just the latest averages themselves. IE: requests/sec rose by 20% in the last 20 minutes. Once again, Azure autoscaling engine cannot do this, consider either WASABi (which may do this) or AzureWatch which definitely can do this.

WASABi is an application block from Microsoft (ie: a DLL) that you'll need to configure, host and monitor somewhere yourself. It is pretty flexible and you can override whatever functionality since it is open source.

AzureWatch is a third-party managed service that monitors/autoscales/heals your Azure roles/Virtual Machines/Websites/SQL Azure/etc. It costs money but you let someone else do all the dirty work.

I recently wrote a blog about the comparison of the three products

Disclosure: I'm affiliated with AzureWatch

HTH

Igorek
  • 15,716
  • 3
  • 54
  • 92
  • 1
    A bit of clarification: You don't *need* WASABi or AzureWatch. There are other services such as MetricsHub. And you can even roll your own by examining perf counters, application logs, queue lengths, etc, and you can perform the scale actions with PowerShell calls. Having said all that: auto-scaling is complex and you're probably better off with a library or service as @Igorek mentioned. But there *are* other options. – David Makogon Aug 10 '13 at 03:57
  • 1
    Thank you, I have registered for a trial version of Azure watch. I am still getting a hang of it, as in which all performance counters to put.....there are soo many options. I did look at the requests/sec option in web service section and its min aggregation period was 5mins. How would it help me if there are 1000 requests within a space of 1min and my web role doesnt scale till 5mins? – Bitsian Aug 12 '13 at 15:19
  • Unfortunately, I cannot think of any solutions for super-sudden spikes that have no correlating leading indicators. You will need to wait at least 5 mins for X amount of servers to be brought up. However, if these spikes were related to some sort of a marketing/promotion schedule, time of day, amount of sign-ups, or some other leading indicator, then it would definitely be possible to accommodate scale-ahead strategy. – Igorek Aug 12 '13 at 15:39
  • Sudden, unexpected spikes are hard. You can carry extra capcaity to handle a decent sized spike (say 30%-50% more, depending on your risk tolerance/cost averion is). Sure you may not be able to withstand being hit by a massive load all at once, but with carrying some extra capacity you get the ability to reduce the impact while other servers are coming on. Also, look at the idea of reduced features if your site detects it has come under heavy load. Degrade the experience, not failure of the site. – MikeWo Aug 13 '13 at 11:43
0

Another reason why the minimum time is 5 minutes is because it takes Azure some time to assign additional machines to your Cloud Service and replicate your software onto them. (WebApps dont have that 'problem') In my work as a saas admin I have found that for Cloud Services this ramp up time after scaling can be around 3-5 minutes for our software package.

If you want to configure scaling within the Azure portal, then my suggestion would be to significantly lower your CPU ranges. As Igorek mentioned Azure scaling looks at the Average over the last 60 minutes. If a Cloud Service is running at 5% CPU for most of the time, then suddenly it peaks and runs at 99%, it will take some time for the Average to go up and trigger your scale settings. Leaving it at 80% will cause scaling to happen far too late. RL example: I manage a portal that runs some CPU intensive calculations. At normal usage our Cloud Services tend to run at 2-5% CPU but on rare occasion we've seen it go up to 99% and stay there for a while.

My first scaling attempt was 2 instances and scaling up with 2 at 80% average CPU, but then it took around 40 minutes for the event to trigger because the Average CPU did not go up that fast. Right now I have everything set to scale when average CPU goes over 25% and what I see is that our Services will scale up after 10-12 minutes. I'm not saying 25% is the magic number, I'm saying keep in mind that you're working with "average over 60 minutes"

The second thing is that the Azure Portal only shows a limited set of scaling options, and scaling can be set in greater detail when you use Powershell / REST. The 60 minute interval over which the average is calculated for example can be lowered.

Krullthor
  • 11
  • 3