2

I am trying to write a block device driver that reads/writes blocks off of/to a network socket. At some point the when reading multiple blocks the application that uses this driver seems to hang (but would still accept input even though it does nothing with it) and the system in general seems responsive. dmesg shows the following message. And overall I can not use the driver for anything even if I started any other application that uses it.

I am using linux kernel v3.9.

Anyone can help fix this?

[  489.779458] INFO: task xxd:2939 blocked for more than 120 seconds.
[  489.779466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  489.779469] xxd             D 0000000000000000     0  2939   2237 0x00000006
[  489.779475]  ffff8801912a9998 0000000000000046 02fc000000000008 ffff8801bfff7000
[  489.779479]  ffff8801b2ef45f0 ffff8801912a9fd8 ffff8801912a9fd8 ffff8801912a9fd8
[  489.779482]  ffff8801b61e9750 ffff8801b2ef45f0 ffff8801912a9998 ffff8801b8e34af8
[  489.779485] Call Trace:
[  489.779497]  [<ffffffff81131ad0>] ? __lock_page+0x70/0x70
[  489.779505]  [<ffffffff816e86a9>] schedule+0x29/0x70
[  489.779510]  [<ffffffff816e877f>] io_schedule+0x8f/0xd0
[  489.779514]  [<ffffffff81131ade>] sleep_on_page+0xe/0x20
[  489.779518]  [<ffffffff816e654a>] __wait_on_bit_lock+0x5a/0xc0
[  489.779522]  [<ffffffff811348aa>] ? find_get_pages+0xca/0x150
[  489.779526]  [<ffffffff81131ac7>] __lock_page+0x67/0x70
[  489.779531]  [<ffffffff8107fa50>] ? autoremove_wake_function+0x40/0x40
[  489.779536]  [<ffffffff81140bd2>] truncate_inode_pages_range+0x4b2/0x4c0
[  489.779540]  [<ffffffff81140c65>] truncate_inode_pages+0x15/0x20
[  489.779545]  [<ffffffff811d331c>] kill_bdev+0x2c/0x40
[  489.779548]  [<ffffffff811d3931>] __blkdev_put+0x71/0x1c0
[  489.779552]  [<ffffffff811aeb48>] ? __d_free+0x48/0x70
[  489.779556]  [<ffffffff811d3adb>] blkdev_put+0x5b/0x160
[  489.779559]  [<ffffffff811d3c05>] blkdev_close+0x25/0x30
[  489.779564]  [<ffffffff8119b16a>] __fput+0xba/0x240
[  489.779568]  [<ffffffff8119b2fe>] ____fput+0xe/0x10
[  489.779572]  [<ffffffff8107ba18>] task_work_run+0xc8/0xf0
[  489.779577]  [<ffffffff8105f797>] do_exit+0x2c7/0xa70
[  489.779581]  [<ffffffff8106f32e>] ? send_sig_info+0x1e/0x20
[  489.779585]  [<ffffffff8106f34c>] ? send_sig+0x1c/0x20
[  489.779588]  [<ffffffff8105ffd4>] do_group_exit+0x44/0xa0
[  489.779592]  [<ffffffff8106fe00>] get_signal_to_deliver+0x230/0x600
[  489.779600]  [<ffffffff81014398>] do_signal+0x58/0x8e0
[  489.779605]  [<ffffffff81014ca0>] do_notify_resume+0x80/0xc0
[  489.779608]  [<ffffffff816f241a>] int_signal+0x12/0x17
feeling_lonely
  • 6,665
  • 4
  • 27
  • 53
  • You should definitely expect that as a possibility when dealing with a network, though it's an open question if the occurrence you are seeing is happening for legitimate reasons, or due to an implementation/configuration mistake. – Chris Stratton Oct 10 '13 at 17:34
  • The driver successfully reads a number of blocks (I did not try to figure out how many). But it stops at one point. I am sure it is something bad I did. When it tries to read only one block, it does that successfully. It hangs only when it tries to read several (121) sequential blocks in the device. – feeling_lonely Oct 10 '13 at 17:56
  • From the looks of it. Looks like I am doing something wrong with the locks. But I am not sure what it is. – feeling_lonely Oct 10 '13 at 17:59
  • Can you break the problem down by faking part of it, such as the network, and see if the issue exists with only the block device end? – Chris Stratton Oct 10 '13 at 18:03
  • Well, yes. I essentially modified the ram disk driver here: < http://www.linuxforu.com/2012/02/device-drivers-disk-on-ram-block-drivers/ >. The ram disk works, however, when I added the network part to it, I ran into many problems. I will send the link to my code in another comment. – feeling_lonely Oct 10 '13 at 18:15
  • here it is: https://bitbucket.org/hebbo_pub/networkblockdevice – feeling_lonely Oct 10 '13 at 18:20
  • the hangs seem to happen when a block number that is multiple of 16 is being read. – feeling_lonely Oct 10 '13 at 20:24
  • I was reading again through the linux sources and I found this http://lxr.free-electrons.com/source/block/blk-core.c?v=3.9#L1556 I am almost sure that this the part that is giving me the headaches. Now, the question is: I think I am removing requests from the queue as they are being served, why it still accumulates more than 16 requests? – feeling_lonely Oct 10 '13 at 21:20
  • Use `nowatchdog` kernel boot option or rewrite your code. – Ilya Matveychikov Oct 11 '13 at 12:07
  • My main purpose is to have the process running fine with no hangs. That means disabling the watchdog does not help. I would like to fix the code, but I am not sure how. Why the request queue is accumulating more than 16 requests even though I am serving requests and de-queuing them from the queue? – feeling_lonely Oct 11 '13 at 14:44

1 Answers1

0

I had the synchronization done wrong around the socket. This meant some race conditions that left some requests without being served. Those not served requests caused the process to hang.

Adding some mutexes (not semaphores) fixed this.

feeling_lonely
  • 6,665
  • 4
  • 27
  • 53