High Load, low CPU, Memory and Disk IO - Highend Server

Question

This issue has been bugging me for the past several days having spent over 40 hours investigating this issue intensively.

Effectively we run asterisk 1.4.42 which I understand is old, however is the last real stable asterisk version which works withour upstream providers in regards to fax (upgrading is not an option).

Now the issue is, we have the following spec server:

Dell Poweredge 1950

Quad Core Xeon 2.5Ghz E5420

8 GB ECC Ram

4 x 73GB SAS 10k RPM HDs

Dell PERC 5 RAID Controller in Raid 10

Centos 5.9 X64

Disk Formatting EXT3

Now the problem is, we are having very high server load on 100 concurrent calls in asterisk. I cannot figure it out. I have another server that is of similar spec but its Quad core2duo, raid 1, 2 x 250GB 7,200 RPM HDs and 8GB non ECC ram that is handling 200+ concurrent calls and is about 0.3 server load.

I am really to my end with this and cannot figure it out.

I have attached screen shots of top and iotop results

The screen shots show low CPU usage, Low Memory usage and 0% wait time on Disk IO

top - http://chostwales.com/images/hosted/Super-load.jpg

iotop - http://chostwales.com/images/hosted/HighDISKIO.jpg

Any help/ideas would be really really appreciated on this.

To clarify this is 100 concurrent calls with approx 1 new call every second. ( As mentioned above, I have servers of much less spec doing 10 new calls ever second and the load is hardly budging)

To clarify:

No Call Recording/Monitoring
Transcoding is about 30% of the calls. (However this would be CPU from understanding)
We are NOT running any PRI's

cat /proc/interrupts shows (No system utilisation currently)

[root@IS-21418 ~]# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       
  0:    7855099          0          0          0    IO-APIC-edge  timer
  1:          3          0          0          0    IO-APIC-edge  i8042
  8:          1          0          0          0    IO-APIC-edge  rtc
  9:          0          0          0          0   IO-APIC-level  acpi
 12:          4          0          0          0    IO-APIC-edge  i8042
 66:         24          0          0          0   IO-APIC-level  ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb4
 74:         34     106102          0          0   IO-APIC-level  uhci_hcd:usb3, uhci_hcd:usb5
 82:       4143      50727          0          0   IO-APIC-level  megasas
 90:     123985          0          0          0         PCI-MSI  eth0
NMI:        435        195        209        215 
LOC:    7852754    7851976    7852615    7851820 
ERR:          0
MIS:          0


[root@IS-21418 ~]# vmstat 1 20
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 7318888  23108 296540    0    0   125    61 1169 2581  2  3 93  1  0
 0  0      0 7318708  23124 296524    0    0     8   280 9704 20440  7  6 87  0  0
 0  0      0 7318820  23140 296768    0    0   128   280 9144 19752  2  5 93  0  0
 0  0      0 7318820  23180 296728    0    0     0  1620 8162 16012  2  2 97  0  0
 0  0      0 7318940  23208 296760    0    0    12   392 9729 22355  3  5 92  0  0
 0  0      0 7318544  23216 296752    0    0     0   100 9679 20152  2  2 96  0  0
 0  0      0 7317852  23232 296836    0    0     8   332 9753 21294  8  9 84  0  0
 0  0      0 7317720  23240 296828    0    0     4   160 9702 22166  3  3 95  0  0
 0  0      0 7317612  23248 296908    0    0     0   192 9643 20168  1  4 95  0  0
 0  0      0 7317340  23256 296900    0    0     0   112 9043 19541  2  2 96  0  0
 0  0      0 7315860  23264 296944    0    0     4   156 9025 21814  3  4 92  0  0
 0  0      0 7315624  23288 297176    0    0   140   504 9221 19047  6  6 87  1  0
 0  0      0 7314872  23296 297140    0    0     4   112 9499 21123  3  8 89  0  0
 3  0      0 7314492  23344 297092    0    0     4  1784 9725 24151  5  6 88  0  0
 1  0      0 7314796  23352 297192    0    0     0   176 9624 22662  4  7 89  0  0
 3  0      0 7314556  23368 297176    0    0     4   220 9789 23502  5  6 88  0  0
 2  0      0 7313820  23384 297196    0    0     4   348 9531 23117 14 13 74  0  0
 1  0      0 7313468  23432 297148    0    0    12   504 9852 25504  6 11 83  0  0
 2  0      0 7313104  23440 297268    0    0     4   112 9610 26564  6  7 88  0  0
 0  0      0 7312364  23464 297244    0    0   128   356 9608 23673  5  8 87  0  0

Dmesg Link is below

Kind Regards

What is the actual problem? You report high load, which could be a useful clue to figure out what's causing your problem, but what is the actual problem? Is performance poor? — David Schwartz, Jul 17 '13 at 20:05
This is the problem, we don't know whats causing it. Obviously it gets much much worse when we start calling. When idle, the server load can still be around 0.09 - 0.2. — TheMightY, Jul 17 '13 at 20:08
@tomtom: I'll have you know the 1950 was the height of Dell's 9th generation product line! It was so good that after that, they changed the naming convention! Just because they're on the 13th generation now... — Satanicpuppy, Jul 17 '13 at 20:19
Maybe an hardware issue!? Wich protocol is used (sip, iax, isdn, bri, other?) How are distributed your interrupts `cat /proc/interrupts`? What's in your logs: `kern.log` and maybe asterisk.log... — F. Hauri - Give Up GitHub, Jul 17 '13 at 20:41
We are using SIP. Possibly a hardware issue. I have posted the idea interrupts above and will post the kern.log as/when I can. — TheMightY, Jul 17 '13 at 21:02
Kernal Messages (underload) - http://chostwales.com/images/hosted/dmesg.txt (too many lines to post here) — TheMightY, Jul 17 '13 at 21:30
You are not showing threads, there may be more tasks to see in top if you do this. Also provide the output of `grep Cpus_allowed_list /proc//status` where is the pid of the asterisk process. — Matthew Ife, Jul 17 '13 at 21:55
@TheMightY: What is the "this" that is the problem? Is performance poor? — David Schwartz, Jul 17 '13 at 23:58
@MIfe I am trying to post grep Cpus_allowed_list /proc//status results where i'm changing the to the process PID number, but it's not showing anything. Just returning a new line in SSH. I have also tried other processes as well such as MYSQL etc. — TheMightY, Jul 18 '13 at 07:39

score 2 · Answer 1 · answered Jul 17 '13 at 20:19

2

Things like this vary a lot. For instance, are you recording calls? If so, are you using Monitor or MixMonitor? Monitor is processed in the same thread as the call, MixMonitor in it's own thread. And if you are recording, you probably have a solid disk hit. I solve some of this by turning off atime in /etc/fstab.

Something you can do to get a idea of what is going on in your system is to run vmstat. A simple vmstate 1 20 will you an optput to look at and you can see what is eating at the CPU.

Another thing that you can do with asterisk is remove modules you don't need by adding "noload =>" lines to modules.conf. Often, there are a lot. You'll just have to take some time to learn what modules you do and don't use as all are autoloaded during startup.

One more thing to consider is trans-coding. If you're accepting calls using the G.729A codec and your softphones/deskphones use G.711u, you're going to take a performance hit as as it has to trans-code those codecs and can't just preform packet-2-packet bridging.

answered Jul 17 '13 at 20:19

Matt W

129
3

To clarify: - No Call Recording/Monitoring - Transcoding is about 30% of the calls. (However this would be CPU from understanding) – TheMightY Jul 17 '13 at 20:30
Just added VMSTAT underload (above) – TheMightY Jul 17 '13 at 21:24
Interrupts look kind of high. Are you running any PRIs in this system? I've run into issues before where I've started turning off serial and USB ports in the BIOS to cut down. PRIs cause this go go a lot higher. – Matt W Jul 17 '13 at 21:41
I have a very similar 1950 setup as yours and I can reach 200 calls easy without issue. The difference is I'm running 1.8. This leads me to believe that it may be Asterisk 1.4. While I understand you need it for faxing purposes, maybe it's time to split those operations onto different servers? That or see if there is some port available for the new code. – Matt W Jul 18 '13 at 16:51

score 0 · Answer 2 · answered Jul 17 '13 at 20:02

0

I found Munin helpful to identify bottlenecks. You can easily spot limits when a graph does not scale as the others.

answered Jul 17 '13 at 20:02

Stefan

1

Thanks for this, I should have mentioned we use eLuna and they backup what were seeing in the SSH sessions – TheMightY Jul 17 '13 at 20:08

High Load, low CPU, Memory and Disk IO - Highend Server

2 Answers2