1

One of our production clusters driven by XCP suddenly went uresponsive. After restart and some investigation we found such logs in dom0 machine syslog:

Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659040] irq 339: nobody cared (try booting with the "irqpoll" option)
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659058] Pid: 0, comm: swapper/3 Tainted: G         C O 3.2.0-24-generic #37-Ubuntu
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659060] Call Trace:
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659062]  <IRQ>  [<ffffffff810db37d>] __report_bad_irq+0x3d/0xe0
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659071]  [<ffffffff810db605>] note_interrupt+0x135/0x190
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659074]  [<ffffffff810d8e69>] handle_irq_event_percpu+0xa9/0x220
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659078]  [<ffffffff8130ff3b>] ? radix_tree_lookup+0xb/0x10
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659081]  [<ffffffff810d9031>] handle_irq_event+0x51/0x80
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659084]  [<ffffffff810dc187>] handle_edge_irq+0x87/0x140
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659089]  [<ffffffff813a8829>] __xen_evtchn_do_upcall+0x199/0x250
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659092]  [<ffffffff813aa96f>] xen_evtchn_do_upcall+0x2f/0x50
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659096]  [<ffffffff81666d3e>] xen_do_hypervisor_callback+0x1e/0x30
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659097]  <EOI>  [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659104]  [<ffffffff810013aa>] ? hypercall_page+0x3aa/0x1000
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659107]  [<ffffffff8100a1d0>] ? xen_safe_halt+0x10/0x20
Oct 26 20:32:03 hetzner-2-mrx kernel: [1797931.659110]  [<fff

IRQ 339 in cat /proc/interrupts:

339:  ...  xen-pirq-msi-x     eth0

where eth0 is hardware NIC.

While host machine seems to hang, guest machines continue to work, so our tiny internal monitoring on one of the virtual hosts logged something like that:

[2012-10-26 20:31:51] [OK......] 200 OK : 113159149 ns
[2012-10-26 20:32:40] [DISASTER] 500 Can't connect to [hostname]:80 (No route to host) : 47763284432 ns
...
[2012-10-26 20:34:40] [DISASTER] 500 Can't connect to [hostname]:80 (No route to host) : 46894835070 ns
[2012-10-26 20:34:57] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 16821741955 ns
...
[2012-10-26 20:38:18] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 20103298289 ns
[2012-10-26 20:38:37] [DISASTER] 500 Can't connect to [hostname]:80 (Bad hostname) : 17895754943 ns

Host and guest OS: Ubuntu 12.04 LTS,

05:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
Subsystem: ASUSTeK Computer Inc. Device 8369
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 17
Region 0: Memory at fe500000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at e000 [size=32]
Region 3: Memory at fe520000 (32-bit, non-prefetchable) [size=16K]
Capabilities: <access denied>
Kernel driver in use: e1000e
Kernel modules: e1000e

Any hints how to debug this?

Vlad Fedin
  • 111
  • 1
  • 2
    Did you "try booting with the "irqpoll" option" like the error suggests? – mdpc Oct 26 '12 at 22:35
  • 1
    Nope, cause it's first time issue and I read about this option:"The irqpoll option is used for machines with broken interrupt routing; when an interrupt arrives, the kernel tries the handler for all other interrupt lines, too, and it does the same checks in every timer interrupt (in case that some interrupt did not arrive at all). This lowers performance, and is useful only if some device would _not_ work without it. Your e100 works, so don't bother." – Vlad Fedin Oct 27 '12 at 09:23

0 Answers0