ESX 3.5 Resource Groups

Question

I'm a DBA and manage a vmware ESX 3.5 cluster that predominently hosts SQL Servers and a few application servers and I have a question about how to setup the resource groups, but I'm in conflict with one of the ESX system admins about how to manage the resources.

The cluster (3 nodes, 32GB per node) currently hosts 33 guests configured to consume 77GB of RAM, although ESX is reporting that only 44GB is active. The cluster hosts live, test, development servers and a few other miscellaneous guests.

What I'd like to do is simplify the management of the servers resources, and to be able to manage and report the performance of related servers.

For example, the resources consumed (RAM, Disk, CPU) for the Live SQL servers, the SharePoint servers, the CRM servers etc.

What I have next done is create 4 "top level" resource groups.

1-High    - For the most mission critical services (ie. the live SQL server)
  32768 memory shares
2-Normal  - For the majority of the remaining live systems (CRM, Sharepoint etc)
  16384 memory shares
3-Dev     - Test and development systems
  8192 memory shares
4-Low     - Non supported servers (no sla, temporary build servers etc)
  1024 memory shares

I have grouped the servers into their own "application" resource groups (SQL Live, SQL Test, CRM Live, CRM Test etc) but have not set any explicit resource limits on these groups.

And then I put the "application" groups into the appropriate "top level" resource group.

For example, each sub group has 4 guests, each 1 CPU and 1GB RAM

1-High               32768 shares
    SQL Live         4 guests
2-Normal             16384 shares
    CRM Live         4 guests
    Sharepoint Live  4 guests
3-Dev                16384 shares
    CRM Test         4 guests
    SQL Test         4 guests
    Sharepoint test  4 guests
4-Low 
    Remaining cruft  4 guests

The sysadmin chap is telling me that "Sharepoint will only get 28% of 50% of the resources it needs!"

Before I reply to him, can I get some advice and a check on my assumptions:

In normal operation the cluster is not overcommitting RAM (or CPU) so there is no resource limits being applied to any guest, either CPU or RAM.
If one of the hosts fails, then there will only be 64GB of RAM available. As the guests are restarted (we have HA and DRS enabled) the remaining hosts will start to restart the guests and this will overcommit the RAM.
I want to ensure that the highest priority services maintain their service
I dont want to micromanage each individual guest!

What are your thoughts and expericences??

Answer given to @Helvick as it is the correct explaination of Resource Group shares. But it does not answer the bullet points at the end of the question. Yes, placing 'Sharepoint' into a group gives it less shares, but that's what I want it to do. I do not want 'SharePoint' (a limited use system) to have a disproportionate share of CPU/RAM compared to more important systems (SQL Server). SharePoint should 'lose' the same amount of resources as the other systems and not be a special case. Anyway - thanks for your help. — Guy, Jan 22 '10 at 22:40

Helvick · Accepted Answer · 2010-01-13T23:23:41.123

If I'm reading this correctly then you are correct about the normal operation of your environment but I'm not sure if either of you are correct about how it works when contention arises.

When there is no contention ( contention starts when resource utilization exceeds 80% BTW) then shares have no effect. So as far as normal operations in your environment are concerned the Resource Groups will be cosmetic.

When there is contention then CPU resources will be constrained as your sysadmin has indicated but that wont necessarily happen if you lose a host.

You don't say whether you have modified shares on the child resource pools. I'm going to assume these are all set to normal.

Assuming that there is contention though the way shares work is that each Resource Pool gets the proportion of the resources that is equal to its fraction of the total quantity of shares at that level. For your first level you have ~58k shares so the High Pool gets approx 56%, the normal gets 28%, Dev gets 14% and Low gets 1.7%. Within each Pool the sub-pools share the resources of that pool equally unless you have explicitly set additional shares at that level, if you have the same rules apply but the total for the pool remains unaffected.

So in your case when contention arises the Live Sharepoint systems will get 50% of 28% of contended resources, ie 14%.

You can help things along somewhat by allocating reservations for the absolute minimum values of CPU and RAM that each system needs. The reserved values are guaranteed to the systems\resource pools you allocate them to and are not allocated by shares. The key drawback with them is that if the values are too high the cluster may be unable to even attempt to restart the VM's as the resources cannot be guaranteed.

Also remember that even though your systems only consume ~44GB under normal operation with Windows systems 100% of memory gets (briefly) allocated when a VM is started up. This can trigger a contention scenario for memory during a failover even though there is actually enough RAM for the systems once they are running. It's something to keep an eye on more than worry too much about but it can cause problems during HA restarts.

Edited to add
If you've made no changes to the default share settings on individual VM's or Child Resource Groups then the proportion of resources allocated to individual VM's will not change when you move all VM's up a level in a structure where there is only a single Child RG and place them directly in the parent. However if there are multiple child RG's and different numbers of VM's in each then this isn't true.

In your example say we have your 4 Sharepoint VMs in their child RG and 2 CRM VMs in their child group. The Sharepoint VM's get ~3.5% each (50% of 28% / 4) and the CRM VMs get 7% each (50% of 28% /2). If you now move all of them up to the parent RG and delete the empty child RG's you now have 6 VM's sharing the 28% of resources available to the Normal RG and each one will get ~4.7% (28% / 6).

Of course if you change the shares on the child Resource Groups or individual VM's this will all change.

Thanks Helvick. I didn't know about the 80% rule as the point that resource management starts. FYI - I'm not worried about CPU as that's running at around 20%-30% for the cluster. RAM is my constraint. I've not modified the child RG's. — Guy, Jan 13 '10 at 20:47
But... For the sake of argument, each sub group has 4 guests each 1 CPU, 1GB RAM. From the answer below, the SharePoint sub group will only get 14% of shares, but each guest will still receive the same NUMBER of shares regardless of if the guest is in the top level group or a the sub group. (I will add clarification to the question) — Guy, Jan 13 '10 at 21:05
The sysadmin says that need to flatten the structure, but the maths still works the same. An individual SharePoint server (in the SP Live resource group) gets 4% (50% of 28% / 4) of total shares - 2048 shares in total. If all the servers where put into a top level group it would still only get 4% (28% / 8) and the same number of shares 2048. — Guy, Jan 13 '10 at 21:24
...and if a host goes down then each guest still gets the same relative number of shares (4% of less shares) but the smaller resource groups get penalised more because Dev's 14% and Low's 1% of a busy system trying to recover is not very much, and the most important systems (the live SQL servers) get propotionally MORE shares. I think... — Guy, Jan 13 '10 at 21:27
I've tried to clarify this in an additional example added into the answer, it's not entirely that straightforward. It might work out the same but that depends on how many VMs are in other child RG's and what you do with them. — Helvick, Jan 13 '10 at 23:28

score 1 · Answer 2 · answered Jan 13 '10 at 16:00

1

Resource definitions only ever take effect in an overcommitted cluster.

answered Jan 13 '10 at 16:00

Chopper3

101,299
9
108
239

Thats what I thought. So while the cluster is running smoothly then it makes no difference how I set up the shares and it will not stop SharePoint (or any other server) from getting as much resource as it requires. – Guy Jan 13 '10 at 21:32

ESX 3.5 Resource Groups

2 Answers2