Is there any solution to the XFS lockup in linux?

Question

Apparently there is a known problem of XFS locking up the kernel/processes and corrupting volumes under heavy traffic. Some web pages talk about it, but I was not able to figure out which pages are new and may have a solution.

My company's deployments have Debian with kernel 3.4.107, xfsprogs 3.1.4, and large storage arrays. We have large data (PB) and high throughput (GB/sec) using async IO to several large volumes. We constantly experience these unpredictable lockups on several systems. Kernel logs/dmesg show something like the following:

2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986515] INFO: task Sr2dReceiver-5:46829 blocked for more than 120 seconds.
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986518] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986520] Sr2dReceiver-5  D ffffffff8105b39e     0 46829   7284 0x00000000
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986524]  ffff881e71f57b38 0000000000000082 000000000000000b ffff884066763180
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986529]  0000000000000000 ffff884066763180 0000000000011180 0000000000011180
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986532]  ffff881e71f57fd8 ffff881e71f56000 0000000000011180 ffff881e71f56000
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986536] Call Trace:
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986545]  [<ffffffff814ffe9f>] schedule+0x64/0x66
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986548]  [<ffffffff815005f3>] rwsem_down_failed_common+0xdb/0x10d
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986551]  [<ffffffff81500638>] rwsem_down_write_failed+0x13/0x15
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986555]  [<ffffffff8126b583>] call_rwsem_down_write_failed+0x13/0x20
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986558]  [<ffffffff814ff320>] ? down_write+0x25/0x27
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986572]  [<ffffffffa01f29e0>] xfs_ilock+0xbc/0x12e [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986580]  [<ffffffffa01eec71>] xfs_rw_ilock+0x2c/0x33 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986586]  [<ffffffffa01eec71>] ? xfs_rw_ilock+0x2c/0x33 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986593]  [<ffffffffa01ef234>] xfs_file_aio_write_checks+0x41/0xfe [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986600]  [<ffffffffa01ef358>] xfs_file_buffered_aio_write+0x67/0x179 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986603]  [<ffffffff8150099a>] ? _raw_spin_unlock_irqrestore+0x30/0x3d
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986611]  [<ffffffffa01ef81d>] xfs_file_aio_write+0x163/0x1b5 [xfs]
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986614]  [<ffffffff8106f1af>] ? futex_wait+0x22c/0x244
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986619]  [<ffffffff8110038e>] do_sync_write+0xd9/0x116
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986622]  [<ffffffff8150095f>] ? _raw_spin_unlock+0x26/0x31
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986634]  [<ffffffff8106f2f1>] ? futex_wake+0xe8/0xfa
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986637]  [<ffffffff81100d1d>] vfs_write+0xae/0x10a
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986639]  [<ffffffff811015b3>] ? fget_light+0xb0/0xbf
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986642]  [<ffffffff81100dd3>] sys_pwrite64+0x5a/0x79
2016 Mar 24 04:42:34 hmtmzhbgb01-ssu-1 kernel: [2358750.986645]  [<ffffffff81506912>] system_call_fastpath+0x16/0x1b

Lockups leave the system in a bad state. The processes in D state that hang cannot even be killed with signal 9. The only way to resume operations is to reboot, repair XFS and then the system works for another while. But occasionally after the lockup we cannot even repair some volumes, as they get totally corrupted and we need to rebuild them with mkfs.

As a last resort, we now run xfs-repair periodically and this reduced the frequency of lockups and data loss to a certain extent. But the incidents still occur often enough, so we need some solution.

I was wondering if there is a solution for this with kernel 3.4.107, e.g. some patch that we may apply. Due to the large number of deployments and other software issues, we cannot upgrade the kernel in the near future.

However, we are working towards updating our applications so that we can run on kernel 3.16 in our next releases. Does anyone know if this XFS lockup problem was fixed in 3.16?

The thing is that with one or two system in the lab we cannot even reproduce the problem, but then with thousands in the field this keeps happening. — tktuserA, Apr 08 '16 at 17:01

score 0 · Answer 1 · answered Apr 12 '16 at 07:05

Some people have experienced this but it was not a problem with XFS it was because the kernel was unable to flush dirty pages within the 120s time period. Have a look here but please check the numbers they're using as default on your own system.

http://blog.ronnyegner-consulting.de/2011/10/13/info-task-blocked-for-more-than-120-seconds/

and here

http://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/

You can see what you're dirty cache ratio is by running this

sysctl -a | grep dirty

or

cat /proc/sys/vm/dirty_ratio

The best write up on this I could find is here...

https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

Essentially you need to tune your application to make sure that it can write the dirty buffers to disk within the time period or change the timer period etc.

You can also see some interesting paramaters as follows

sysctl -a | grep hung

You could increase the timeout permanently using /etc/sysctl.conf as follows...

kernel.hung_task_timeout_secs = 300

Thank you for the answer. We had actually looked into this and here are the parameters we use: vm.dirty_background_bytes = 0, vm.dirty_background_ratio = 5, vm.dirty_bytes = 0, vm.dirty_expire_centisecs = 3000, vm.dirty_ratio = 20, vm.dirty_writeback_centisecs = 500. The problem still occurs, and we also see it when the array throughput is far less than the max. we normally have. Here is another link that talks about the problem: http://oss.sgi.com/archives/xfs/2014-08/msg00377.html Could not figure out if there is a solution. Would you have any other suggestions? — tktuserA, Apr 13 '16 at 17:56
There are some hints here... http://oss.sgi.com/archives/xfs/2014-08/msg00377.html I think you're going to have to get core dumps and find out what xfs_repair is saying ie it might hint at what went wrong. It might also be a good idea to get on the xfs mailing list. — Harry, Apr 13 '16 at 19:54

score 0 · Answer 2 · answered Feb 06 '18 at 10:14

Does anyone know if this XFS lockup problem was fixed in 3.16?

It is said so in A Short Guide to Kernel Debugging:

Searching for “xfs splice deadlock” turns up an email thread from 2011 that describes this problem. However, bisecting the kernel source repository shows that the bug wasn’t really addressed until April, 2014 (8d02076) for release in Linux 3.16.

Is there any solution to the XFS lockup in linux?

2 Answers2