0

I have a docker swarm overlay network that connects 6 nodes each running 4 containers with highly frequent communication. I have been trying to identify the bottleneck with my network to realize that the culprit is the ksoftirqd process related to the docker swarm networking that uses up all the CPU in the manager node and causes my app to crash. So my question is has anyone found a workaround for this? I am trying to avoid migration to Kubernetes.

ksoftirqd process

The Hardware info:
System:    Host: caliper-latest Kernel: 4.15.0-99-generic x86_64 bits: 64 gcc: 7.5.0 Console: tty 1
           Distro: Ubuntu 18.04.4 LTS
Machine:   Device: kvm System: QEMU product: Standard PC (i440FX + PIIX 1996) v: pc-i440fx-2.8 serial: N/A
           Mobo: N/A model: N/A serial: N/A BIOS: SeaBIOS v: 1.10.2-1 date: 04/01/2014
CPU(s):    15 Single core QEMU Virtual version 2.5+s (-SMP-) arch: P6 II rev.3 cache: 245760 KB
           flags: (lm nx sse sse2 sse3) bmips: 79799
           clock speeds: max: 2659 MHz 1: 2659 MHz 2: 2659 MHz 3: 2659 MHz 4: 2659 MHz 5: 2659 MHz 6: 2659 MHz
           7: 2659 MHz 8: 2659 MHz 9: 2659 MHz 10: 2659 MHz 11: 2659 MHz 12: 2659 MHz 13: 2659 MHz 14: 2659 MHz
           15: 2659 MHz
Graphics:  Card: Cirrus Logic GD 5446 bus-ID: 00:02.0
           Display Server: N/A driver: N/A tty size: 270x20 Advanced Data: N/A out of X
Network:   Card: Realtek RTL-8100/8101L/8139 PCI Fast Ethernet Adapter
           driver: 8139cp v: 1.3 port: c000 bus-ID: 00:03.0
           IF: ens3 state: up speed: 100 Mbps duplex: full mac: <filter>
Drives:    HDD Total Size: 32.2GB (27.3% used)
           ID-1: /dev/vda model: N/A size: 32.2GB
Partition: ID-1: / size: 29G used: 8.2G (29%) fs: ext4 dev: /dev/vda1
RAID:      No RAID devices: /proc/mdstat, md_mod kernel module present
Sensors:   None detected - is lm-sensors installed and configured?
Info:      Processes: 238 Uptime: 3:46 Memory: 774.3/14022.2MB Init: systemd runlevel: 5 Gcc sys: 7.5.0
           Client: Shell (bash 4.4.201) inxi: 2.3.56

  • Do you have that behaviour in every swarm node? – João Alves Apr 30 '20 at 08:42
  • No, only the manager. – Nima Afraz Apr 30 '20 at 09:57
  • What is the base OS on which the docker swarm is deployed? And the hardware? – João Alves Apr 30 '20 at 12:19
  • @JoãoAlves I added the HW info to the question. – Nima Afraz Apr 30 '20 at 13:42
  • Please take a look at `/proc/softirqs` to see what is going on. – João Alves Apr 30 '20 at 15:08
  • @JoãoAlves Thanks, I checked softirqs and realized only one vCPU is being used and since virtualization platform (OpenNebula) allows vCPU consolidation, I exposed the 15 CPUs I allocated to the VM as a single CPU. Now I don't have the ksoftirqd problem. Hoewer, This did not solve my overall performance problem with Docker Swarm overlay network. – Nima Afraz Apr 30 '20 at 18:32
  • The problem of the system crashing was not resolved. I posted a new question with more details. https://serverfault.com/questions/1015272/unbalanced-interrupt-handling-in-multi-cpu-leads-to-a-crash-uhci-hcdusb1-ens3 – Nima Afraz May 04 '20 at 16:29

0 Answers0