blocked for more than 120 seconds

Question

I am trying to write a block device driver that reads/writes blocks off of/to a network socket. At some point the when reading multiple blocks the application that uses this driver seems to hang (but would still accept input even though it does nothing with it) and the system in general seems responsive. dmesg shows the following message. And overall I can not use the driver for anything even if I started any other application that uses it.

I am using linux kernel v3.9.

Anyone can help fix this?

[  489.779458] INFO: task xxd:2939 blocked for more than 120 seconds.
[  489.779466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.779469] xxd             D 0000000000000000     0  2939   2237 0x00000006
[  489.779475]  ffff8801912a9998 0000000000000046 02fc000000000008 ffff8801bfff7000
[  489.779479]  ffff8801b2ef45f0 ffff8801912a9fd8 ffff8801912a9fd8 ffff8801912a9fd8
[  489.779482]  ffff8801b61e9750 ffff8801b2ef45f0 ffff8801912a9998 ffff8801b8e34af8
[  489.779485] Call Trace:
[  489.779497]  [<ffffffff81131ad0>] ? __lock_page+0x70/0x70
[  489.779505]  [<ffffffff816e86a9>] schedule+0x29/0x70
[  489.779510]  [<ffffffff816e877f>] io_schedule+0x8f/0xd0
[  489.779514]  [<ffffffff81131ade>] sleep_on_page+0xe/0x20
[  489.779518]  [<ffffffff816e654a>] __wait_on_bit_lock+0x5a/0xc0
[  489.779522]  [<ffffffff811348aa>] ? find_get_pages+0xca/0x150
[  489.779526]  [<ffffffff81131ac7>] __lock_page+0x67/0x70
[  489.779531]  [<ffffffff8107fa50>] ? autoremove_wake_function+0x40/0x40
[  489.779536]  [<ffffffff81140bd2>] truncate_inode_pages_range+0x4b2/0x4c0
[  489.779540]  [<ffffffff81140c65>] truncate_inode_pages+0x15/0x20
[  489.779545]  [<ffffffff811d331c>] kill_bdev+0x2c/0x40
[  489.779548]  [<ffffffff811d3931>] __blkdev_put+0x71/0x1c0
[  489.779552]  [<ffffffff811aeb48>] ? __d_free+0x48/0x70
[  489.779556]  [<ffffffff811d3adb>] blkdev_put+0x5b/0x160
[  489.779559]  [<ffffffff811d3c05>] blkdev_close+0x25/0x30
[  489.779564]  [<ffffffff8119b16a>] __fput+0xba/0x240
[  489.779568]  [<ffffffff8119b2fe>] ____fput+0xe/0x10
[  489.779572]  [<ffffffff8107ba18>] task_work_run+0xc8/0xf0
[  489.779577]  [<ffffffff8105f797>] do_exit+0x2c7/0xa70
[  489.779581]  [<ffffffff8106f32e>] ? send_sig_info+0x1e/0x20
[  489.779585]  [<ffffffff8106f34c>] ? send_sig+0x1c/0x20
[  489.779588]  [<ffffffff8105ffd4>] do_group_exit+0x44/0xa0
[  489.779592]  [<ffffffff8106fe00>] get_signal_to_deliver+0x230/0x600
[  489.779600]  [<ffffffff81014398>] do_signal+0x58/0x8e0
[  489.779605]  [<ffffffff81014ca0>] do_notify_resume+0x80/0xc0
[  489.779608]  [<ffffffff816f241a>] int_signal+0x12/0x17

You should definitely expect that as a possibility when dealing with a network, though it's an open question if the occurrence you are seeing is happening for legitimate reasons, or due to an implementation/configuration mistake. — Chris Stratton, Oct 10 '13 at 17:34
The driver successfully reads a number of blocks (I did not try to figure out how many). But it stops at one point. I am sure it is something bad I did. When it tries to read only one block, it does that successfully. It hangs only when it tries to read several (121) sequential blocks in the device. — feeling_lonely, Oct 10 '13 at 17:56
From the looks of it. Looks like I am doing something wrong with the locks. But I am not sure what it is. — feeling_lonely, Oct 10 '13 at 17:59
Can you break the problem down by faking part of it, such as the network, and see if the issue exists with only the block device end? — Chris Stratton, Oct 10 '13 at 18:03
Well, yes. I essentially modified the ram disk driver here: < http://www.linuxforu.com/2012/02/device-drivers-disk-on-ram-block-drivers/ >. The ram disk works, however, when I added the network part to it, I ran into many problems. I will send the link to my code in another comment. — feeling_lonely, Oct 10 '13 at 18:15
here it is: https://bitbucket.org/hebbo_pub/networkblockdevice — feeling_lonely, Oct 10 '13 at 18:20
the hangs seem to happen when a block number that is multiple of 16 is being read. — feeling_lonely, Oct 10 '13 at 20:24
I was reading again through the linux sources and I found this http://lxr.free-electrons.com/source/block/blk-core.c?v=3.9#L1556 I am almost sure that this the part that is giving me the headaches. Now, the question is: I think I am removing requests from the queue as they are being served, why it still accumulates more than 16 requests? — feeling_lonely, Oct 10 '13 at 21:20
My main purpose is to have the process running fine with no hangs. That means disabling the watchdog does not help. I would like to fix the code, but I am not sure how. Why the request queue is accumulating more than 16 requests even though I am serving requests and de-queuing them from the queue? — feeling_lonely, Oct 11 '13 at 14:44

score 0 · Accepted Answer · answered Oct 16 '13 at 17:47

0

I had the synchronization done wrong around the socket. This meant some race conditions that left some requests without being served. Those not served requests caused the process to hang.

Adding some mutexes (not semaphores) fixed this.

answered Oct 16 '13 at 17:47

feeling_lonely

6,665
4
27
53

blocked for more than 120 seconds

1 Answers1