I'm a DBA and manage a vmware ESX 3.5 cluster that predominently hosts SQL Servers and a few application servers and I have a question about how to setup the resource groups, but I'm in conflict with one of the ESX system admins about how to manage the resources.
The cluster (3 nodes, 32GB per node) currently hosts 33 guests configured to consume 77GB of RAM, although ESX is reporting that only 44GB is active. The cluster hosts live, test, development servers and a few other miscellaneous guests.
What I'd like to do is simplify the management of the servers resources, and to be able to manage and report the performance of related servers.
For example, the resources consumed (RAM, Disk, CPU) for the Live SQL servers, the SharePoint servers, the CRM servers etc.
What I have next done is create 4 "top level" resource groups.
1-High - For the most mission critical services (ie. the live SQL server)
32768 memory shares
2-Normal - For the majority of the remaining live systems (CRM, Sharepoint etc)
16384 memory shares
3-Dev - Test and development systems
8192 memory shares
4-Low - Non supported servers (no sla, temporary build servers etc)
1024 memory shares
I have grouped the servers into their own "application" resource groups (SQL Live, SQL Test, CRM Live, CRM Test etc) but have not set any explicit resource limits on these groups.
And then I put the "application" groups into the appropriate "top level" resource group.
For example, each sub group has 4 guests, each 1 CPU and 1GB RAM
1-High 32768 shares
SQL Live 4 guests
2-Normal 16384 shares
CRM Live 4 guests
Sharepoint Live 4 guests
3-Dev 16384 shares
CRM Test 4 guests
SQL Test 4 guests
Sharepoint test 4 guests
4-Low
Remaining cruft 4 guests
The sysadmin chap is telling me that "Sharepoint will only get 28% of 50% of the resources it needs!"
Before I reply to him, can I get some advice and a check on my assumptions:
- In normal operation the cluster is not overcommitting RAM (or CPU) so there is no resource limits being applied to any guest, either CPU or RAM.
- If one of the hosts fails, then there will only be 64GB of RAM available. As the guests are restarted (we have HA and DRS enabled) the remaining hosts will start to restart the guests and this will overcommit the RAM.
- I want to ensure that the highest priority services maintain their service
- I dont want to micromanage each individual guest!
What are your thoughts and expericences??