When writing sequentially to files on my large-ish (70 TiB) btrfs filesystem, I am noticing that something is causing prolonged write blocks (seconds to several minutes). This is inconvenient at best and is causing application timeouts in some cases.
My understanding is that btrfs commits should happen in the background and not block my writes, and this is also what I see when writing sequentially to an existing file without having new blocks allocated for it most of the time. If I am writing and extending my file at high data rates, it becomes messy quickly.
My test case is a simple dd if=/dev/zero of=/mnt/btrfs/subvolume/testfile
which starts writing at 150 MB/s, mainly dirtying blocks in the host's page cache. The file system has some background activity here which is causing the reads seen in the statistics collected for this device with iostat -xdmt 1
:
Time: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
01:09:59 PM 0.00 0.00 3761.00 0.00 14.69 0.00 8.00 0.38 0.10 0.10 38.00
01:10:00 PM 0.00 0.00 2868.00 0.00 11.20 0.00 8.00 0.42 0.15 0.15 41.60
01:10:01 PM 0.00 0.00 4002.00 0.00 15.63 0.00 8.00 0.43 0.11 0.11 42.80
01:10:02 PM 0.00 0.00 3331.00 0.00 13.01 0.00 8.00 0.54 0.16 0.16 54.40
01:10:03 PM 0.00 0.00 2674.00 0.00 10.45 0.00 8.00 0.62 0.23 0.23 62.00
01:10:04 PM 0.00 0.00 2771.00 0.00 10.82 0.00 8.00 0.59 0.21 0.21 58.80
01:10:05 PM 0.00 12.00 3004.00 2744.00 11.73 264.09 98.28 5.80 1.01 0.09 50.80
01:10:06 PM 0.00 24.00 2661.00 1970.00 10.39 92.86 45.66 75.36 16.12 0.16 74.80
01:10:07 PM 0.00 0.00 1556.00 628.00 6.08 171.23 166.27 9.84 4.83 0.33 72.00
01:10:08 PM 0.00 0.00 1298.00 0.00 5.07 0.00 8.00 0.41 0.31 0.27 35.20
So far, all looks well and the commit at 01:10:05 has caused ~500 MB of data being flushed to the disk. This however is what happened subsequently:
Time: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
01:10:09 PM 0.00 0.00 1459.00 115.00 5.70 28.75 44.82 0.62 0.38 0.36 56.80
01:10:10 PM 0.00 8.00 268.00 6816.00 1.05 567.26 164.30 101.84 12.83 0.14 96.80
01:10:11 PM 0.00 4.00 4.00 5396.00 0.02 857.71 325.30 155.78 30.02 0.19 100.40
01:10:12 PM 0.00 16.00 10.00 3179.00 0.04 489.67 314.50 162.47 43.53 0.31 100.00
01:10:13 PM 0.00 13.00 6.00 2785.00 0.02 412.01 302.35 148.54 45.91 0.36 100.00
01:10:14 PM 0.00 8.00 0.00 2745.00 0.00 431.29 321.77 141.90 47.04 0.36 100.00
01:10:15 PM 0.00 9.00 16.00 4235.00 0.06 660.77 318.37 144.02 47.59 0.24 100.00
01:10:16 PM 0.00 8.00 11.00 2771.00 0.04 455.26 335.18 145.08 45.15 0.36 100.00
01:10:17 PM 0.00 18.00 17.00 2688.00 0.07 416.14 315.12 149.61 52.68 0.37 100.00
01:10:18 PM 0.00 15.00 32.00 4133.00 0.12 641.32 315.41 154.58 42.92 0.24 100.00
01:10:19 PM 0.00 3.00 23.00 1921.00 0.09 306.26 322.74 141.80 47.59 0.51 100.00
01:10:20 PM 0.00 10.00 13.00 3553.00 0.05 546.22 313.73 144.95 51.66 0.28 100.00
01:10:21 PM 0.00 6.00 24.00 4623.00 0.09 688.63 303.53 154.29 35.90 0.22 100.00
01:10:22 PM 0.00 8.00 61.00 10151.00 0.24 176.37 35.42 136.80 13.06 0.10 100.00
01:10:23 PM 0.00 0.00 22.00 9858.00 0.09 43.74 9.09 142.66 14.88 0.10 100.00
01:10:24 PM 0.00 0.00 4.00 14740.00 0.02 68.51 9.52 137.17 9.33 0.07 100.00
01:10:25 PM 0.00 0.00 4.00 9070.00 0.02 39.50 8.92 143.02 14.69 0.11 100.00
01:10:26 PM 0.00 0.00 6.00 10659.00 0.02 46.90 9.01 143.00 14.29 0.09 100.00
01:10:27 PM 0.00 0.00 8.00 13098.00 0.03 60.62 9.48 140.74 10.88 0.08 100.00
01:10:28 PM 0.00 0.00 7.00 10284.00 0.03 46.80 9.32 141.98 12.37 0.10 100.00
01:10:29 PM 0.00 0.00 8.00 8562.00 0.03 38.33 9.17 143.43 17.97 0.12 100.00
01:10:30 PM 0.00 0.00 7.00 10996.00 0.03 50.37 9.38 141.34 13.11 0.09 100.00
01:10:31 PM 0.00 0.00 6.00 12012.00 0.02 57.04 9.72 141.58 11.72 0.08 100.00
01:10:32 PM 0.00 0.00 7.00 12666.00 0.03 57.08 9.23 143.91 11.35 0.08 100.00
01:10:33 PM 0.00 0.00 7.00 7741.00 0.03 34.26 9.06 142.60 18.57 0.13 100.00
Time: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
01:10:34 PM 0.00 0.00 3.00 7262.00 0.01 30.16 8.50 144.23 19.86 0.14 100.00
01:10:35 PM 0.00 0.00 7.00 6518.00 0.03 27.27 8.57 144.97 20.80 0.15 100.00
01:10:36 PM 0.00 0.00 5.00 4828.00 0.02 19.70 8.36 143.86 30.95 0.21 100.00
01:10:37 PM 0.00 0.00 0.00 7858.00 0.00 35.13 9.16 144.19 18.65 0.13 100.00
01:10:38 PM 0.00 0.00 4.00 11041.00 0.02 48.77 9.05 143.72 13.04 0.09 100.00
01:10:39 PM 0.00 0.00 11.00 8409.00 0.04 35.82 8.72 144.18 16.87 0.12 100.00
01:10:40 PM 0.00 0.00 1440.00 1206.00 5.62 5.09 8.29 33.14 14.05 0.29 77.20
01:10:41 PM 0.00 0.00 2464.00 0.00 9.62 0.00 8.00 0.60 0.24 0.24 59.60
01:10:42 PM 0.00 0.00 2230.00 950.00 8.71 259.55 172.77 9.96 1.99 0.18 58.40
01:10:43 PM 0.00 0.00 10.00 3031.00 0.04 827.06 557.02 143.37 47.04 0.33 100.00
01:10:44 PM 0.00 0.00 7.00 1918.00 0.03 523.48 556.96 144.06 73.96 0.52 100.40
01:10:45 PM 0.00 0.00 8.00 2206.00 0.03 601.99 556.89 143.66 64.11 0.45 100.00
01:10:46 PM 0.00 0.00 16.00 2286.00 0.06 623.87 555.09 141.53 61.57 0.43 100.00
01:10:47 PM 0.00 1.00 826.00 533.00 3.23 145.17 223.63 35.18 31.19 0.64 86.80
01:10:48 PM 0.00 0.00 2596.00 0.00 10.14 0.00 8.00 0.64 0.25 0.25 64.00
01:10:49 PM 0.00 0.00 1725.00 570.00 6.74 154.55 143.93 3.89 1.69 0.33 75.60
This amounts to 30 seconds of a full queue and a blocked dd
process in the meantime. The process remains in D
(uninterruptable sleep) state for the duration of the block.
This is not as cleanly reproducible as I would wish it to be, in terms that block/delay times may vary significantly. What is happening here and how could I remedy the blockages?
Environment
SLES 11 SP3. Mount options: /dev/sdb1 on /mnt/btrfs type btrfs (rw,noatime,thread_pool=16)
. I tried mounting with nodatacow
as well, but dd
continues playing Sleeping Beauty with me.
$ uname -a
Linux host-1 3.0.101-0.47.71-default #1 SMP Thu Nov 12 12:22:22 UTC 2015 (b5b212e) x86_64 x86_64 x86_64 GNU/Linux
# cat /sys/bus/scsi/devices/*/queue_depth
256
256
256
BTRFS is spreading over one single 72 TiB disk which is a RAID-60 VD off an LSI MPT RAID (MegaRAID 2208 w/ 1 GB FBWC), this is not my root file system and not used for any system-internal activity (swap, logs, dumps, ...).
The platform hardware is a single-CPU E5-2620 (no NUMA) with 64 GB of RAM.
Edit
My real-world use case is a Windows client writing data to a dynamically expanding vhdx (virtual disk) file through Samba (3.6.3), but the behavior is essentially the same as with dd
- the file is growing rapidly and sequentially as blocks are written to the vhdx virtual disk. It also is blocking just the same way as dd
I have done some additional troubleshooting on this, concentrating on processes in D
state. There I noticed, that a number of btrfs-related Kernel threads are stuck in D
along with Samba. As I cannot use strace
to attach to Kernel threads, I tried working with the information from wchan
as output in ps
. This is supposed to tell me which function caused the process/thread to be waiting in D
:
while true; do ps -eo pid,stat,pcpu,nwchan,wchan=WIDE-WCHAN-COLUMN -o args | awk -v date="$(date +%r)" '$2 ~ /D/ { print date " " $4 " " $5 " " $6 }'; sleep 0.2; done
And this is what it looks like in action.
Here, the regular 30-seconds-commit starts:
11:15:11 AM /usr/sbin/smbd lock_page
11:15:11 AM [flush-btrfs-1] -
11:15:12 AM /usr/sbin/smbd lock_page
11:15:12 AM [flush-btrfs-1] -
11:15:12 AM [btrfs-endio-wri] wait_on_page_lock
11:15:12 AM /usr/sbin/smbd lock_page
11:15:12 AM [flush-btrfs-1] -
11:15:12 AM [btrfs-submit-1] get_request_wait
11:15:12 AM /usr/sbin/smbd lock_page
11:15:12 AM /usr/sbin/smbd lock_page
11:15:12 AM [flush-btrfs-1] -
11:15:12 AM [btrfs-submit-1] get_request_wait
11:15:12 AM [btrfs-transacti] wait_on_page_lock
11:15:12 AM /usr/sbin/smbd lock_page
11:15:12 AM [flush-btrfs-1] lock_page
11:15:13 AM [btrfs-transacti] wait_on_page_lock
11:15:13 AM /usr/sbin/smbd lock_page
11:15:13 AM [flush-btrfs-1] lock_page
11:15:13 AM [btrfs-transacti] wait_on_page_lock
11:15:13 AM /usr/sbin/smbd lock_page
11:15:13 AM [flush-btrfs-1] lock_page
11:15:13 AM [btrfs-transacti] wait_on_page_lock
11:15:13 AM /usr/sbin/smbd -
11:15:13 AM [btrfs-transacti] wait_on_page_lock
11:15:13 AM /usr/sbin/smbd lock_page
11:15:13 AM [btrfs-transacti] wait_on_page_lock
11:15:13 AM /usr/sbin/smbd lock_page
At this point, it seems to have completed. An iostat -xdmt
run confirms that nothing is written to disk after 11:15:13:
01/07/2016 11:15:11 AM
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sdb 0.00 0.00 3437.00 0.00 13.43 0.00 8.00 0.50 0.15 0.15 50.40
01/07/2016 11:15:12 AM
sdb 0.00 0.00 1056.00 4475.00 4.12 1221.39 453.78 58.13 9.59 0.16 87.60
01/07/2016 11:15:13 AM
sdb 0.00 0.00 1477.00 1082.00 5.77 293.28 239.33 55.36 23.63 0.36 91.20
01/07/2016 11:15:14 AM
sdb 0.00 0.00 2846.00 0.00 11.12 0.00 8.00 0.89 0.32 0.24 68.80
01/07/2016 11:15:15 AM
sdb 0.00 0.00 1591.00 0.00 6.21 0.00 8.00 0.64 0.40 0.34 54.40
Then, at around 11:15:17, the writing process is blocked for good and not responding anymore until around 11:15:29:
11:15:14 AM /usr/sbin/smbd -
11:15:14 AM /usr/sbin/smbd lock_page
11:15:14 AM /usr/sbin/smbd lock_page
11:15:14 AM /usr/sbin/smbd lock_page
11:15:15 AM /usr/sbin/smbd lock_page
11:15:15 AM [btrfs-transacti] wait_on_page_lock
11:15:15 AM /usr/sbin/smbd lock_page
11:15:16 AM /usr/sbin/smbd lock_page
11:15:16 AM /usr/sbin/smbd lock_page
11:15:17 AM /usr/sbin/smbd lock_page
11:15:17 AM [btrfs-transacti] get_request_wait
11:15:17 AM /usr/sbin/smbd get_request_wait
11:15:17 AM [btrfs-transacti] get_request_wait
11:15:17 AM /usr/sbin/smbd get_request_wait
11:15:17 AM [btrfs-transacti] get_request_wait
11:15:17 AM /usr/sbin/smbd lock_page
11:15:18 AM [btrfs-transacti] get_request_wait
11:15:18 AM /usr/sbin/smbd lock_page
11:15:18 AM /usr/sbin/smbd lock_page
11:15:18 AM [btrfs-transacti] get_request_wait
11:15:18 AM /usr/sbin/smbd lock_page
11:15:18 AM [btrfs-transacti] get_request_wait
11:15:18 AM /usr/sbin/smbd lock_page
11:15:18 AM [btrfs-transacti] get_request_wait
11:15:18 AM /usr/sbin/smbd lock_page
11:15:19 AM [btrfs-transacti] get_request_wait
11:15:19 AM /usr/sbin/smbd lock_page
11:15:19 AM [btrfs-transacti] get_request_wait
11:15:19 AM /usr/sbin/smbd lock_page
11:15:19 AM [btrfs-transacti] get_request_wait
11:15:19 AM /usr/sbin/smbd lock_page
11:15:19 AM [btrfs-transacti] get_request_wait
11:15:19 AM /usr/sbin/smbd lock_page
11:15:19 AM [btrfs-transacti] get_request_wait
11:15:19 AM /usr/sbin/smbd lock_page
11:15:20 AM /usr/sbin/smbd lock_page
11:15:20 AM [btrfs-transacti] get_request_wait
11:15:20 AM /usr/sbin/smbd lock_page
11:15:20 AM [btrfs-transacti] get_request_wait
11:15:20 AM /usr/sbin/smbd lock_page
11:15:20 AM [btrfs-transacti] get_request_wait
11:15:20 AM /usr/sbin/smbd lock_page
11:15:20 AM [btrfs-transacti] get_request_wait
11:15:20 AM /usr/sbin/smbd lock_page
11:15:21 AM [btrfs-transacti] get_request_wait
11:15:21 AM /usr/sbin/smbd lock_page
11:15:21 AM [btrfs-transacti] get_request_wait
11:15:21 AM /usr/sbin/smbd lock_page
11:15:21 AM [btrfs-transacti] get_request_wait
11:15:21 AM /usr/sbin/smbd lock_page
11:15:21 AM [btrfs-transacti] get_request_wait
11:15:21 AM /usr/sbin/smbd lock_page
11:15:22 AM [btrfs-transacti] get_request_wait
11:15:22 AM /usr/sbin/smbd lock_page
11:15:22 AM [btrfs-transacti] get_request_wait
11:15:22 AM /usr/sbin/smbd lock_page
11:15:22 AM [btrfs-transacti] get_request_wait
11:15:22 AM /usr/sbin/smbd lock_page
11:15:22 AM [btrfs-transacti] get_request_wait
11:15:22 AM /usr/sbin/smbd lock_page
11:15:22 AM [btrfs-transacti] get_request_wait
11:15:22 AM /usr/sbin/smbd lock_page
11:15:23 AM [btrfs-transacti] get_request_wait
11:15:23 AM /usr/sbin/smbd lock_page
11:15:23 AM [btrfs-transacti] get_request_wait
11:15:23 AM /usr/sbin/smbd lock_page
11:15:23 AM [btrfs-transacti] get_request_wait
11:15:23 AM /usr/sbin/smbd lock_page
11:15:23 AM [btrfs-transacti] get_request_wait
11:15:23 AM /usr/sbin/smbd lock_page
11:15:23 AM [btrfs-transacti] get_request_wait
11:15:23 AM /usr/sbin/smbd lock_page
11:15:24 AM [btrfs-transacti] get_request_wait
11:15:24 AM /usr/sbin/smbd lock_page
11:15:24 AM [btrfs-transacti] get_request_wait
11:15:24 AM /usr/sbin/smbd lock_page
11:15:24 AM /usr/sbin/smbd lock_page
11:15:24 AM [btrfs-transacti] get_request_wait
11:15:24 AM /usr/sbin/smbd get_request_wait
11:15:24 AM [btrfs-transacti] get_request_wait
11:15:24 AM /usr/sbin/smbd lock_page
11:15:25 AM [btrfs-transacti] get_request_wait
11:15:25 AM /usr/sbin/smbd lock_page
11:15:25 AM [btrfs-transacti] get_request_wait
11:15:25 AM /usr/sbin/smbd lock_page
11:15:25 AM [btrfs-transacti] get_request_wait
11:15:25 AM /usr/sbin/smbd lock_page
11:15:25 AM [btrfs-transacti] get_request_wait
11:15:25 AM /usr/sbin/smbd lock_page
11:15:26 AM [btrfs-transacti] get_request_wait
11:15:26 AM /usr/sbin/smbd lock_page
11:15:26 AM [btrfs-transacti] get_request_wait
11:15:26 AM /usr/sbin/smbd lock_page
11:15:26 AM [btrfs-transacti] get_request_wait
11:15:26 AM /usr/sbin/smbd lock_page
11:15:26 AM [btrfs-transacti] get_request_wait
11:15:26 AM /usr/sbin/smbd lock_page
11:15:26 AM [btrfs-transacti] get_request_wait
11:15:26 AM /usr/sbin/smbd lock_page
11:15:27 AM [btrfs-transacti] get_request_wait
11:15:27 AM /usr/sbin/smbd lock_page
11:15:27 AM [btrfs-transacti] get_request_wait
11:15:27 AM /usr/sbin/smbd lock_page
11:15:27 AM [btrfs-transacti] get_request_wait
11:15:27 AM /usr/sbin/smbd lock_page
11:15:27 AM [btrfs-transacti] get_request_wait
11:15:27 AM /usr/sbin/smbd lock_page
11:15:27 AM [btrfs-transacti] get_request_wait
11:15:27 AM /usr/sbin/smbd lock_page
11:15:28 AM [btrfs-transacti] get_request_wait
11:15:28 AM /usr/sbin/smbd lock_page
11:15:28 AM [btrfs-transacti] get_request_wait
11:15:28 AM /usr/sbin/smbd lock_page
11:15:28 AM [btrfs-transacti] get_request_wait
11:15:28 AM /usr/sbin/smbd lock_page
11:15:28 AM [btrfs-transacti] get_request_wait
11:15:28 AM /usr/sbin/smbd lock_page
11:15:29 AM [btrfs-transacti] get_request_wait
11:15:29 AM /usr/sbin/smbd lock_page
11:15:29 AM [btrfs-transacti] get_request_wait
11:15:29 AM /usr/sbin/smbd lock_page
11:15:29 AM [btrfs-transacti] get_request_wait
11:15:29 AM /usr/sbin/smbd lock_page
11:15:29 AM [btrfs-transacti] get_request_wait
11:15:29 AM /usr/sbin/smbd lock_page
11:15:29 AM [btrfs-transacti] get_request_wait
11:15:29 AM /usr/sbin/smbd lock_page
11:15:30 AM /usr/sbin/smbd lock_page
11:15:30 AM /usr/sbin/smbd lock_page
11:15:30 AM /usr/sbin/smbd lock_page
11:15:30 AM /usr/sbin/smbd lock_page
This is reflected in iostats
data as write load with high avgqu-sz:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
11:15:16 AM
sdb 0.00 0.00 3865.00 0.00 15.10 0.00 8.00 0.46 0.12 0.11 43.20
11:15:17 AM
sdb 0.00 0.00 706.00 9021.00 2.76 116.05 25.01 62.22 6.13 0.10 95.60
11:15:18 AM
sdb 0.00 0.00 13.00 7344.00 0.05 32.70 9.12 141.22 19.35 0.14 100.00
11:15:19 AM
sdb 0.00 0.00 12.00 6219.00 0.05 25.79 8.49 142.56 22.29 0.16 100.00
11:15:20 AM
sdb 0.00 0.00 17.00 4987.00 0.07 20.30 8.34 144.62 29.01 0.20 100.00
11:15:21 AM
sdb 0.00 0.00 2.00 5965.00 0.01 24.98 8.58 143.66 24.63 0.17 100.00
11:15:22 AM
sdb 0.00 0.00 0.00 8145.00 0.00 39.87 10.03 143.13 17.25 0.12 100.00
11:15:23 AM
sdb 0.00 0.00 3.00 4371.00 0.01 18.66 8.74 143.18 32.77 0.23 100.00
11:15:24 AM
sdb 0.00 0.00 2.00 4142.00 0.01 17.32 8.56 144.14 34.99 0.24 100.00
11:15:25 AM
sdb 0.00 0.00 4.00 5014.00 0.02 20.66 8.44 142.62 28.58 0.20 100.00
11:15:26 AM
sdb 0.00 0.00 2.00 4321.00 0.01 17.68 8.38 146.49 33.98 0.23 100.00
11:15:27 AM
sdb 0.00 0.00 0.00 5741.00 0.00 25.59 9.13 145.61 25.26 0.17 100.00
11:15:28 AM
sdb 0.00 0.00 6.00 6775.00 0.02 28.79 8.70 145.66 21.69 0.15 100.00
11:15:29 AM
sdb 0.00 0.00 8.00 9546.00 0.03 43.75 9.38 141.55 14.60 0.10 100.00
11:15:30 AM
sdb 0.00 0.00 544.00 5247.00 2.12 24.03 9.25 73.91 13.26 0.16 92.00
11:15:31 AM
sdb 0.00 0.00 2294.00 0.00 8.96 0.00 8.00 0.62 0.27 0.27 62.40
Looking at what get_request_wait is supposed to mean, I stumbled upon "Understanding the Linux Kernel" which is covering the I/O scheduler as well and is stating:
The number of request descriptors may become, under very heavy loads and high disk activity, a bottleneck. A dearth of free descriptors may force processes to wait until an ongoing data transfer terminates. Thus, a wait queue is used to queue processes waiting for a free request element. The
get_request_wait()
tries to get a free request descriptor and puts the current process to sleep in the wait queue, if none is found.
(emphasis mine)
I tried playing with the value in /sys/block/sdb/queue/nr_requests, increasing it to 4096 (from 128). This gave me even longer queue size values in iotop (~4k) at times of the stall, but the overall picture remained the same: many small (8K) IO write operations for a prolonged period of time at approximately the same rate (4K-12K IOPS) blocking further writes to the file.