Centos 6.9 (64gb Ram)
Running nginx, mariadb, php-fpm, iptables, java
The server is having random but frequent bursts of 100% system cpu load on only 1 core, crippling network connections to the server.
I found out that even with nginx, mariadb, php-fpm, iptables and java not running the problem persists.
I tried installing irqbalance but nothing changed. I tried restarting several times but nothing changed. I tried yum update but nothing changed. I tried swapping the ssd to another server with the same hardware but nothing changed. I tried SMART checking the ssd for problems with no errors. I checked if the problem was related to swappiness but nothing is being swapped.
The "/proc/interrupts" shows that the interrupt related to the ksoftirqd is eth0 I don't know which steps to make for troubleshooting what's causing the problem. I need help as my services hosted on this server are hurting really bad because of the downtime generated during the bursts (which can last for 10-15 minutes, stop and then reappear randomly).
top or htop does not show anything worrying running or taking that much cpu, just ksoftirqd and events.
The problem started just a few days ago, no changes were made to the kernel/OS that I am aware of that could have caused this problem.
"iostat" during the 100% load
Linux 2.6.32-696.30.1.el6.x86_64 (CentOS-69-64-minimal) _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
8.01 0.00 3.03 0.20 0.00 88.76
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdb 83.52 18.46 1341.05 2874477 208769462
sda 94.26 435.50 1341.05 67797010 208769462
md1 0.00 0.01 0.00 2106 12
md0 0.26 0.25 1.82 38640 283096
md2 176.32 453.67 1322.56 70625762 205890864
"/proc/interrupts" during the 100% load
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
0: 681 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-edge timer
1: 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-edge i8042
8: 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-edge rtc0
9: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi
12: 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC-edge i8042
56: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge aerdrv
57: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge aerdrv
58: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge aerdrv
65: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
66: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
67: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge xhci_hcd
68: 16149263 0 0 0 0 0 0 0 0 0 0 19021454 0 0 0 0 PCI-MSI-edge ahci
69: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge ahci
70: 158827141 0 0 0 82558205 0 0 0 0 0 2755343 0 0 0 0 0 PCI-MSI-edge eth0
NMI: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Non-maskable interrupts
LOC: 123773684 105894389 123476055 142376826 111487788 122494116 118841739 134480148 113422196 121203288 114414525 114218214 114794017 119322938 115083581 119549111 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Performance monitoring interrupts
IWI: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IRQ work interrupts
RES: 54086898 67527262 46597734 44323475 25356657 32869325 18540932 20137227 13606660 13955101 14826738 12242106 10962617 11082631 10466998 10574150 Rescheduling interrupts
CAL: 1258 1407 1440 1446 1474 1442 1448 1436 1436 1435 1435 1431 1438 1449 1449 1430 Function call interrupts
TLB: 8082115 6419817 4992332 3914962 5927373 4081295 4056598 2953591 4134873 3207107 3852793 5106863 3780341 3298234 3875200 3270066 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Threshold APIC interrupts
MCE: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 520 520 520 520 520 520 520 520 520 520 520 520 520 520 520 520 Machine check polls
ERR: 0
MIS: 0
Something strange I've seen on dmesg, which does not print anything problematic but this line, repeated 50 times since boot (replaced my ip with X for privacy reasons):
TCP: Peer X.XX.XXX.XXX:56847/44567 unexpectedly shrunk window 2670303830:2670305282 (repaired)
htop
https://i.stack.imgur.com/lpP1d.png
Any kind of help is appreciated, I'm really desperate to solve this right now.