1

I'm writing a Linux module, in which I have a loop to process work like below:

while (1) {
    while (there's work) {
        process_work
    }
    if (should_stop)
        break
    sleep  // wait to be woken up
}

When there's lots of work, it would result in softlockup. The message is like this:

[ 1426.067061] BUG: soft lockup - CPU#3 stuck for 23s! [comp_wqa:2969]
[ 1426.067903] Modules linked in: testmodule(OE+) xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter hwmon_vid dm_mirror dm_region_hash dm_log dm_mod snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic intel_powerclamp coretemp intel_rapl kvm eeepc_wmi crc32_pclmul asus_wmi ghash_clmulni_intel sparse_keymap rfkill mxm_wmi aesni_intel wmi lrw snd_hda_intel gf128mul glue_helper snd_hda_codec pcspkr ablk_helper sg
[ 1426.067924]  cryptd shpchp snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm tpm_infineon acpi_pad snd_timer mei_me mei snd soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common crc32c_intel serio_raw i915 ahci libahci libata i2c_algo_bit drm_kms_helper drm e1000e ptp pps_core i2c_core video
[ 1426.067939] CPU: 3 PID: 2969 Comm: comp_wqa Tainted: G           OE  ------------   3.10.0-327.28.3.el7.x86_64 #1
[ 1426.067940] Hardware name: ASUS All Series/Z97-A, BIOS 2401 04/24/2015
[ 1426.067941] task: ffff88080f212280 ti: ffff880810a68000 task.ti: ffff880810a68000
[ 1426.067942] RIP: 0010:[<ffffffff8107e11f>]  [<ffffffff8107e11f>] vprintk_emit+0x1bf/0x530
[ 1426.067946] RSP: 0018:ffff880810a6bbc0  EFLAGS: 00000246
[ 1426.067947] RAX: 0000000000000001 RBX: 0000000000000003 RCX: 0000000000000000
[ 1426.067948] RDX: 0000000000000001 RSI: ffff88083fb8f6c8 RDI: 0000000000000246
[ 1426.067948] RBP: ffff880810a6bc20 R08: 0000000000000092 R09: 0000000000007d0d
[ 1426.067949] R10: 0000000000008000 R11: ffffc90023effff8 R12: 0000000000000081
[ 1426.067950] R13: ffffffff81a08020 R14: 000000009176cc6c R15: 0000000000000000
[ 1426.067951] FS:  0000000000000000(0000) GS:ffff88083fb80000(0000) knlGS:0000000000000000
[ 1426.067951] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1426.067952] CR2: 00007f42411ff00e CR3: 000000000194a000 CR4: 00000000001407e0
[ 1426.067953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1426.067954] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1426.067954] Stack:
[ 1426.067955]  ffffffff81cae082 0000000000000071 0000000000000000 ffff880810a6bc40
[ 1426.067956]  ffffffffa07a45a0 000000008116c24e 0000000000000246 ffff8807dfbba800
[ 1426.067958]  ffff880810a70000 ffff8807dfbc5030 ffff8807dfbc4e00 ffff8807e65b3000
[ 1426.067959] Call Trace:

So after some googling, I change the code to the following:

while (1) {
    while (there's work) {
        process_work
        cond_resched()
    }
    if (should_stop)
        break
    sleep  // wait to be woken up
}

And with this code, the softlockups happens less likely. But still, it happens with heavier load. I thought if this thread had been occupying the cpu for long, then cond_resched would give up the cpu. I guess I was wrong.

I want to know how should the softlockups be avoided and at the same without being idle too much (I want the module process lots of work withou long latency).

After thinking more about this, I realize what I want is just make a cpu core run a dedicate thread, without being interrupted. It seems the kernel doesn't support this directly. There is a kernel parameter called watchdog_thresh which decides how many seconds can a thread continuously run. I have read other posts that suggest this kind of softlockups is harmless. And I now understand more deeply that the performance of my driver is heavily dependent on single cpu core performance, since I have to process the work with a single thread.

coderfive
  • 21
  • 1
  • 7
  • @sawdust Maybe I failed to make myself clear. In kernel, if a thread runs continuously for long time (depends on configurations, 20 seconds or 60 seconds and so on), the kernel will print messages says that this thread has stuck for 23 seconds and so on. – coderfive Mar 29 '17 at 08:57
  • What happens if you just call `schedule()` instead of `cond_resched()`? Or does that reduce the work throughput too much? – Ian Abbott Mar 29 '17 at 16:02
  • Note that `cond_resched()` is a no-op if `CONFIG_PREEMPT` is defined. – Ian Abbott Mar 29 '17 at 16:08
  • @IanAbbott `schedule()` would put the thread to sleep, and there needs to be some other thread wake it up, else it would sleep indefinitely. – coderfive Mar 29 '17 at 16:10
  • @IanAbbott I don't think `cond_resched()` is a no-op when `CONFIG_PREEMPT` is defined. Can you give more info about this? – coderfive Mar 29 '17 at 16:14
  • `schedule()` won't put the thread to sleep unless you set `current->state` to something other than `TASK_RUNNING`. If it's in the "running" state, it may get rescheduled immediately, or other tasks might get a turn, but it doesn't need another thread to wake it up from this state. – Ian Abbott Mar 29 '17 at 16:17
  • OK, `cond_resched()` didn't become a no-op for `CONFIG_PREEMPT` until the 4.10 kernel, although the comment for the git commit that changed it says the call to the underlying `_cond_resched()` is pointless for `CONFIG_PREEMPT`. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=35a773a07926a22bf19d77ee00024522279c4e68 – Ian Abbott Mar 29 '17 at 16:30
  • @IanAbbott Thanks a lot. I have a broken mental model of these concepts. – coderfive Mar 29 '17 at 16:35

2 Answers2

1

While kernel thread may hold a CPU in counter to scheduling algorithm, the thread can give up the CPU only if scheduling algorithm decides to do so. See also this question which explains similar things.

You need to adjust your kernel thread so it can be preempted by others. Some of possible solutions:

  • Lower priority of the thread or change scheduling policy. Use sched_setscheduler function for that.

  • After processing several (say, 10) "works" in bunch, pause the thread for a short period of time.

Community
  • 1
  • 1
Tsyvarev
  • 60,011
  • 17
  • 110
  • 153
  • Thanks for your answer. Like I said in the problem details, I want to utilize the cpu to the fullest. The thing is, if I just want it to work, I can use workqueue provided by the kernel. I thought if I implement the loop logic by myself, I may have a better result, which means more "works" gets done. – coderfive Mar 29 '17 at 09:45
  • There is no universal approach for `I want to utilize the cpu to the fullest.` which is applicable for any CPU workload. The main question you ask yourself is: "Assuming some *other* thread becomes ready while I process the works, **in which cases** I want to *give a CPU* to that thread?". Answer to that question you should implement via *scheduling parameters*. – Tsyvarev Mar 29 '17 at 10:09
0

There is a fundamental difference between writing user-space code and kernel code. Bad user code (with respect to scheduling) is often handled (read corrected) by the kernel, whereas bad kernel code is deadly -- there are many ways you can kill the entire machine when writing kernel code. So the answer to your question is you must absolutely design this on paper first. Specifically, unlike writing user code, you must think about the scheduling and design something that will work for you. There are two basic questions to ask and of course to answer:

  1. When does my task run?
  2. When do all of the other tasks on the system run?

Note that you have not answered the second one. Once you have the answers (they will define how the CPU will migrate from one task to another) you can start thinking about how to implement that. Make sure you understand the current behavior; it will really help you with the relevant concepts. The soft lockup means your task is 1) holding the CPU and 2) not allowing the kernel to preempt it (for a long time). Find out why this task can't be preempted (let's hope it's not holding a spinlock). You mention wanting to avoid "being idle"; I am not sure if you mean "your task not running" or "the CPU being idle". Those are two very different cases -- your task must let all other tasks in the system run (as hinted above) so it will very definitely not run when the other tasks run, but you don't necessarily ever have to have an idle CPU, if you have "lots of work". If your goal is avoiding the latter, you are right -- it is often the result of a poor design (throwing in msleep(), instead of taking time to work out the appropriate scheduling algorithm/parameters).

As I said, the first step, before you write a single line of code, is to be able to describe how and when your task will run.

kozel
  • 197
  • 1
  • 4
  • Thanks for your reply. Nowadays, lots of machine have multiple cores. In my case, I don't really need to care about other threads. They can get the cpu time they want. Let me put it this way, I actually want to give a cpu core to just run this thread, and I don't want to share this core with other threads. The problem is I have to process the work sequentially. – coderfive Mar 30 '17 at 07:52
  • @coderfive Sounds like you already realize that the soft lockup is per CPU, so you can't just take over it. But you can lower the priority of your task and let the soft lockup watchdog run when it needs to run, which will in turn eliminate the lockup. Hence my observation that you do need to care about other threads. – kozel Mar 30 '17 at 13:18