1

We're facing intermittent periods of 100% CPU utilization.

Server Configuration:
HP DL580 G7 (4 Processor with 8 cores each; 128GB memory.)
Operating System: Solaris 10_x86 update 9
Application: Oracle 10 R2; ASM for Disk Management. DB size 5TB; SGA 78GB
Storage Subsystem: HP MSA2312sa Dual Controller SAS direct attached storage

On a normal day (CPU utilization 20%) the vmstat output is provided below
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id
0 27 26 128133040 6469184 362 4937 829 3 22 0 117 -0 4 0 97 85888 383138 19238 19 2 79
0 20 31 129089972 4009408 294 4341 28 0 0 0 0 0 2 0 96 144240 363898 27797 12 5 82
1 17 31 128869152 3731692 243 4437 0 0 0 0 0 0 6 0 88 142738 385237 26503 10 5 84
1 21 31 128803936 3665112 283 5545 111 0 0 0 0 0 3 0 102 157962 347356 26940 12 5 82
2 20 31 128556548 3515596 274 10806 0 0 0 0 0 0 6 0 99 253881 391554 34754 13 7 80

Processes Summary:
Run Queue Processes- 0~2     Blocked Processes- 17~27     Swapped Processes- 31
CPU Utilization Summary:
User- 10%~20%     System- 2%~7%     Idle- 79%~85%

What can be the cause of such irractic CPU behaviioiur?
Why is the Blocked Processes (b) and Swapped Out Processes (w) much higher the Running Processes (r)?
Are we looking at CPU bottleneck or Memory Bottleneck or IO Bottleneck?

We do run Oracle RMAN backup, but the backup completes at 4AM everyday.

Whereas the CPU utilization shoots up to 100% during the normal business hours (10AM to 6PM), no background backups run during this period.

As for the large queries, we do run fairly long and complex queries. These queries run everyday and the CPU utilization barely crosses 40% but from the past one week we're experiencing short bursts of 100% CPU utilization.

Jack
  • 11
  • 3

4 Answers4

1

Do you VM's have the same number of processor's as the host system? if so this is a bad thing, and it can prevent the scheduler from working properly. IE if you have an 8 core system then no system on that box should have 8 cores assigned to it. You can have 20 VM's with 4 Cores assigned and that is not a problem, but 1 box with 8 cores assigned can cause problems under load.

tkrabec
  • 300
  • 1
  • 8
1

Are you experiencing 100% utilization across all 32 CPU cores or just a few? I can't really speak to the stats you have posted since they are fairly unreadable, but to try and give some general answers to the things you are experiencing:

Blocked/Swapped Out Processes Sometimes processes on a server OS will bind to a specific CPU core and ONLY use that core for whatever it needs to do, ignoring all other cores. This is generally more of a problem for older pieces of software that weren't designed to run in multi-core systems. The end result is if you have a few processes doing this and they have decided to use the same core, they will constantly block and swap each other out to do what they need to do while you have other cores idle not doing anything. Sometimes you can configure the software to choose specific cores and manually "load-balance" the processes across your CPUs (similar to manual IRQ settings back in the day), but this is obviously undesirable since it requires a manual reconfiguration on your part and you may end up making things worse. Figure out which processes are blocking each other and focus on those. I doubt you have a CPU bottleneck with 32 cores, but I also can't tell for sure. Read the documentation on the processes/software to see what the vendor recommends and if you can even configure the process to do this.

Blocked/Swapped Out Processes higher than Running Processes Likely what is happening is your performance counter is just ticking up every time a process gets blocked/swapped out and is not showing the CURRENT blocked/swapped processes so this should always be higher than your running processes (which is just what it says - the number of currently running processes on your system). This should not be a concern.

August
  • 3,114
  • 16
  • 17
  • r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id – Jack Aug 19 '11 at 19:59
  • Yes we face 100% utilization across all cores. But this condition is prevalent for 30mins to an hour. However during this time even the ASM processes do not get cpu and we've faced a database crash due to the asm process not getting cpu. – Jack Aug 19 '11 at 20:17
1

At first sight, your system had a severe RAM shortage in the past. The average scan rate since last boot is 117 while it should 0 or close to it on a system with enough RAM. This seems to be confirmed by your 31 w column which likely means 31 daemons were swapped out during the ram shortage event and never came back being unused.

jlliagre
  • 8,861
  • 18
  • 36
0

Do you have any automated backup processes or something which would be thrashing the disk(s)? It sounds vaguely like you've got IOwait issues. Can you get a snapshot of mpstat while your server's unhappy? You could probably rule out the disk i/o issue by doing small 5GB writes to disk or something in DIRECT_IO mode (to get around the fact you could cache half the earth in free memory on that sever). Also, have you tried (if you're able) examining your queries during this time? Maybe someone's slamming you with a bunch of full-index scans or something?

MrTuttle
  • 1,176
  • 5
  • 5