0

On an AWS instance x1.32xlarge (128 cores), we are getting a lot of interrupts per seconds.

Here are the top CPUs in interrupts/s:

Interrupts Top CPUs
CPU0: 140838.0
CPU1: 77867.0
CPU4: 66495.0
CPU6: 59941.0
CPU3: 39096.0
CPU2: 31532.0
CPU7: 30861.0
CPU5: 26042.0
CPU8: 4168.0
CPU12: 3026.0
CPU10: 2793.0

Here are the top interrupts/s/CPU:

Interrupts above 10k/s
HYP [Hypervisor callback interrupts] [CPU0] = 46902.0/sec
49 [xen-percpu-ipi resched0] [CPU0] = 43437.0/sec
RES [Rescheduling interrupts] [CPU0] = 41512.0/sec
HYP [Hypervisor callback interrupts] [CPU2] = 26638.0/sec
HYP [Hypervisor callback interrupts] [CPU8] = 22875.0/sec
HYP [Hypervisor callback interrupts] [CPU12] = 20813.0/sec
55 [xen-percpu-ipi resched1] [CPU2] = 20749.0/sec
RES [Rescheduling interrupts] [CPU2] = 19568.0/sec
73 [xen-percpu-ipi resched4] [CPU8] = 16400.0/sec
RES [Rescheduling interrupts] [CPU8] = 15677.0/sec
HYP [Hypervisor callback interrupts] [CPU6] = 14226.0/sec
85 [xen-percpu-ipi resched6] [CPU12] = 14060.0/sec
RES [Rescheduling interrupts] [CPU12] = 13271.0/sec
HYP [Hypervisor callback interrupts] [CPU14] = 12173.0/sec
HYP [Hypervisor callback interrupts] [CPU4] = 11887.0/sec
HYP [Hypervisor callback interrupts] [CPU10] = 10500.0/sec

This happens when the application running on that machine is under significant load. The network traffic is relatively high, and there are lots of threads.

My question is: are 50K/150K interrupts/sec too much? How do we interpret that number? Is there a maximum interrupts/sec?

UPDATE:

Here here a glimpse at the top output:

Tasks: 825 total,   3 running, 822 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.6%us,  3.4%sy,  0.0%ni, 83.6%id,  0.0%wa,  0.0%hi,  2.3%si,  0.0%st
Mem:  2014742856k total, 40059184k used, 1974683672k free,   162036k buffers
Swap:        0k total,        0k used,        0k free,  3159112k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                              
 32936 ec2-user  20   0 77.3g  11g  29m S 1759.7  0.6   1780:36 java                                                                                                                                               
 32118 ec2-user  20   0 64.2g  10g  26m S 1036.9  0.6  62:31.08 java                                                                                                                                               
     3 root      20   0     0    0    0 R 70.4  0.0  14:54.84 ksoftirqd/0                                                                                                                                          
    12 root      20   0     0    0    0 S 21.2  0.0   6:06.47 ksoftirqd/1                                                                                                                                          
    16 root      20   0     0    0    0 S 15.2  0.0   4:33.28 ksoftirqd/2                                                                                                                                          
    20 root      20   0     0    0    0 S 12.2  0.0   3:34.12 ksoftirqd/3                                                                                                                                          
    28 root      20   0     0    0    0 S 11.9  0.0   3:24.96 ksoftirqd/5                                                                                                                                          
    24 root      20   0     0    0    0 S 11.6  0.0   3:26.54 ksoftirqd/4                                                                                                                                          
    32 root      20   0     0    0    0 S 10.2  0.0   3:23.56 ksoftirqd/6                                                                                                                                          
    36 root      20   0     0    0    0 S 10.2  0.0   3:28.80 ksoftirqd/7  

UPDATE2: htop output

benji
  • 487
  • 1
  • 5
  • 11

2 Answers2

1

Most of the interrupts were from net network cards queues, this allowed to spread the load onto others cores: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-cpu-irq.html

benji
  • 487
  • 1
  • 5
  • 11
0

Without knowing what your application does and the load it generates, it is not possible to tell if your system has "too much interrupts" going on.

You can use top to inspect the system load value. If it is high, it means that a significant portion of the CPU load happens in the kernel context. This can, in turn, be a sign of interrupt storming.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • Added top output - ksoftirqd seems to be very high? – benji Jul 27 '17 at 20:38
  • Please expand you `top` output by pressing `1` on your keyboard, and post the detailed result. – shodanshok Jul 27 '17 at 20:43
  • It won't, terminal is too small. But I added the output of htop which I think should be helpful. – benji Jul 27 '17 at 20:59
  • Mmm no, I need to see the per-CPU load. Anyway, with such an high load for `ksoftirqd`, you probably are under an IRQ storm. So yes - you have too much interrupts. To identifying the root cause, you should do an in depth debug of your system. Start by throttling network traffic, and proceed with the other devices. – shodanshok Jul 27 '17 at 21:09
  • Is there a particular reason why only cores 0-8 are under heavy system load? Is it possible to have the other cores help with those IRQs? – benji Jul 27 '17 at 21:19