Linux stuck in CPU soft lockup?

Question

My system is a CentOS 6.3 (running Kernel version 2.6.32-279.el6.x86_64).

I have a loadable kernel module which is a driver that manages a PCIe card. If I manually insert the driver using insmod while the OS is up and running, the driver loads successfully and is operational.

However, if I try to install the driver using rpm and then reboot the system, during startup the OS gets stuck spitting out the following "soft lockup" message for ALL the CPU cores, except for one core that is in "soft lockup" in one of the threads created by my driver.

BUG: soft lockup - CPU#X stuck for 67s! [migration/8:36]
.......(same above message for all cores except one)
BUG: soft lockup - CPU#10 stuck for 67s! [mydriver_thread/8:36]
(one core is locked up in one of the threads in my driver).

I searched the net quite a bit for info on this kernel msg / bug, and there are quite a bit of posts about it, none on what causes it or how to debug. Any help with the following questions would really be appreciated:

I am not able to log into the system, I think it's because all the cores are in a "soft lockup" state, and hence cannot trigger a kernel dump from shell prompt. I enabled SysRq, and tried to trigger a kernel dump with SysRq key combo, but no luck. It seems the system is not responding to keyboard (not even responding to CapsLock button). Any suggestions on how I can trigger a kernel dump in this circumstance?
I can imagine the possibly of my driver thread causing "soft lockup". But how can the "migration" thread (a kernel thread) be in a "soft lockup" just because of my driver?
From browsing the net, the "migration" thread is used to move tasks from one cpu to another. Can someone please help me understand what this thread exact does? And how it can be affected by other threads, if at all.

It would be very helpful if you could show us some stack traces. — cdleonard, Feb 28 '13 at 22:02
Having the problem on reboot makes me think of the many many problems modules have had loading firmware when there is no firmware. Is the driver trying to load from the initial ramdisk? Is it demanding firmware and not getting it? Is your driver looping during initialization and hogging all of the work queue threads or something? — Zan Lynx, Feb 28 '13 at 22:19
@cdleonard There are no backstrace on the screen. All I am getting are sixteen lines of the same kernel message ("BUG: soft lockup .....") for each of the sixteen cores in the system. One of those message is for a core busy with a thread from my driver, and the rest of the core are stuck with the migration thread. — Ahmed A, Feb 28 '13 at 23:13
@Zan Lynx The driver is not loading from the init ramdisk. It does not do any fw download, but just programs a ethernet card. I don't believe the driver is hogging all the work queue threads. If so, would I not have run into the same issue when I performed an ins mod. Out of curiosity, how many work queue threads is a driver allowed to create (max number). — Ahmed A, Feb 28 '13 at 23:18
@ZanLynx There is a config option and boot parameter to panic on softlock: CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC. It will show a lot of useful debug info. softlockup usually means something like "infinite loop with BH disabled" but that's too vague without a stack trace. — cdleonard, Feb 28 '13 at 23:44
@cdleonard In my system, looking into the config file, I see the following two lines : "# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set" "CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0" Is there still a way I can enable the feature by using "sysctl" or bootparam, or do I have to recompile the kernel to enable the feature. — Ahmed A, Mar 01 '13 at 03:36
You should reconfigure/recompile your kernel. You can enable a lot of helpful stuff under "Kernel Hacking" — cdleonard, Mar 01 '13 at 06:47
You can get the kernel to panic (and give you a backtrace) without recompiling. I'm using CentOS 6.4 and simply adding softlockup_panic=1 to the kernel bootline enables this. — David, Sep 17 '13 at 13:50
Have you installed kdump(a tool) for rebooting machine when your programm caused some panic? — cwfighter, Jun 10 '15 at 08:11
Does your driver thread need to run on a certain CPU core to access the PCIe bus? And what is your thread waiting for? Some wait_for_completion call? What's meant to unlock it? — Hervé, Jan 30 '16 at 21:36
You said that you're installing the driver using rpm, can tell me if the rpm is running dracut to create a initramfs image? there may be an issue that you need to add extra things to your initramfs, OR you can simply write a custom script that loads up your driver at the right spot in your bootup sequence. — Ahmed Masud, Jan 15 '17 at 05:14

score 3 · Answer 1 · answered Jun 14 '16 at 12:36

I had a very similar problem on my desktop. It would soft lockup very frequently - about once a day or so.

It turns out it was because I was running on an Intel Haswell. It seems that the Haswell/Broadwell series of Intel processors have a bug which can cause system instability. This bug was fixed in a microcode update.

Check if CentOS offers an intel-microcode package, and install it. Make sure you configure grub to load it as the initial ramdisk before it loads initramfs.

Personally, I upgraded my microcode by booting into Windows and running a BIOS Update. You can check if the micrcode was actually updated by comparing the output of grep 'microcode' /proc/cpuinfo before and after the update.

Linux stuck in CPU soft lockup?

1 Answers1