"No space left on device" error despite having plenty of space, on btrfs

Question

Almost everywhere I'm getting failures in logs complaining about No space left on device

Gitlab logs:

==> /var/log/gitlab/nginx/current <==
2016-11-29_20:26:51.61394 2016/11/29 20:26:51 [emerg] 4871#0: open() "/var/opt/gitlab/nginx/nginx.pid" failed (28: No space left on device)

Dovecot email logs:

Nov 29 20:28:32 aws-management dovecot: imap(email@www.sitename.com): Error: open(/home/vmail/emailuser/Maildir/dovecot-uidlist.lock) failed: No space left on device

Output of df -Th

Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/xvda1     ext4      7.8G  3.9G  3.8G  51% /
devtmpfs       devtmpfs  1.9G   28K  1.9G   1% /dev
tmpfs          tmpfs     1.9G   12K  1.9G   1% /dev/shm
/dev/xvdh      btrfs      20G   13G  7.9G  61% /mnt/durable
/dev/xvdh      btrfs      20G   13G  7.9G  61% /home
/dev/xvdh      btrfs      20G   13G  7.9G  61% /opt/gitlab
/dev/xvdh      btrfs      20G   13G  7.9G  61% /var/opt/gitlab
/dev/xvdh      btrfs      20G   13G  7.9G  61% /var/cache/salt

Looks like there is also plenty of inode space. Output of df -i

Filesystem     Inodes  IUsed  IFree IUse% Mounted on
/dev/xvda1     524288 105031 419257   21% /
devtmpfs       475308    439 474869    1% /dev
tmpfs          480258      4 480254    1% /dev/shm
/dev/xvdh           0      0      0     - /mnt/durable
/dev/xvdh           0      0      0     - /home
/dev/xvdh           0      0      0     - /opt/gitlab
/dev/xvdh           0      0      0     - /var/opt/gitlab
/dev/xvdh           0      0      0     - /var/cache/salt

Output of btrfs fi show

Label: none  uuid: 6546c241-e57e-4a3f-bf43-fa933a3b29f9
        Total devices 4 FS bytes used 11.86GiB
        devid    1 size 10.00GiB used 10.00GiB path /dev/xvdh
        devid    2 size 10.00GiB used 9.98GiB path /dev/xvdi
        devid    3 size 10.00GiB used 9.98GiB path /dev/xvdj
        devid    4 size 10.00GiB used 9.98GiB path /dev/xvdk

Output of btrfs fi df /mnt/durable

Data, RAID10: total=17.95GiB, used=10.12GiB
Data, single: total=8.00MiB, used=0.00
System, RAID10: total=16.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, RAID10: total=2.00GiB, used=1.74GiB
Metadata, single: total=8.00MiB, used=0.00
unknown, single: total=272.00MiB, used=8.39MiB

What could be the cause of this? I'm using a base linux AMI ec2 kernal version 4.4.5-15.26.amzn1.x86_64

Update

Running the command suggested below btrfs fi balance start -dusage=5 /mnt/durable gave me back an error of the following:

ERROR: error during balancing '/mnt/durable' - No space left on device There may be more info in syslog - try dmesg | tail

After manually deleting a bunch of larger files totaling to ~1GB I rebooted the machine and tried again, making sure I was using sudo, and the command executed. I then rebooted my machine once again for good measure and it seems to have solved the problem

Generic tools can't properly understand BTRFS, you need BTRFS specific tools. Please add the output of "btrfs fi show" and "btrfs fi df /mnt/durable" — Peter Green, Nov 29 '16 at 21:09
@PeterGreen added the output of btrfs...looks like you found the culprit. — Austin, Nov 29 '16 at 21:21
Can you also add the output of the second command I suggested. — Peter Green, Nov 29 '16 at 21:24
The kernel version is pretty important here, as btrfs had quite a number of issues with free space in the past, and in case this is another instance future readers could benefit from that information. — PlasmaHH, Nov 30 '16 at 09:30

score 20 · Accepted Answer · edited Dec 06 '16 at 18:06

Welcome to the world of BTRFS. It has some tantalizing features but also some infuriating issues.

First off, some info on your setup, it looks like you have four drives in a BTRFS "raid 10" volume (so all data is stored twice on different disks). This BTRFS volume is then carved up into subvolumes on different mount points. The subvolumes share a pool of disk space but have separate inode numbers and can be mounted in different places.

BTRFS allocates space in "chunks", a chunk is allocated to a specific class of either data or metadata. What can happen (and looks like has happened in your case) is that all free space gets allocated to data chunks leaving no room for metadata

It also seems that (for reasons I don't fully understand) that BTRFs "runs out" of metadata space before the indicator of the proportion of metadata space used reaches 100%.

This appears to be what has happened in your case, there is lots of free data space but no free space that has not been allocated to chunks and insufficient free space in the existing metadata chunks.

The fix is to run a "rebalance". This will move data around so that some chunks can be returned to the "global" free pool where they can be reallocated as metadata chunks

btrfs fi balance start -dusage=5 /mnt/durable

The number after -dusage sets how aggressive the rebalance is, that is how close to empty the blocks have to be to get rewritten. If the balance says it rewrote 0 blocks try again with a higher value of -dusage.

If the balance fails then I would try rebooting and/or freeing up some space by removing files.

Getting `ERROR: error during balancing '/mnt/durable' - No space left on device` even after deleting almost 1 GB from the drive — Austin, Nov 30 '16 at 19:05
Have you tried rebooting (rebooting after cleanup worked for me when I had a similar issue). — Peter Green, Nov 30 '16 at 19:07
@PeterGreen Added contents of `dmesg | tail` in my post after getting a new error after reboot. — Austin, Nov 30 '16 at 19:21

score 4 · Answer 2 · answered Nov 29 '16 at 21:41

4

Since you're running btrfs with a RAID setup, try running a balance operation.

btrfs balance start /var/opt/gitlab

If this gives an error about not having enough space, try again with this syntax:

btrfs balance start -musage=0 -dusage=0 -susage=0 /var/opt/gitlab

Repeat this operation for each btrfs filesystem where you are seeing errors about space. If your space problem is due to the metadata not being distributed across the mirrored disks this might free up some space for you.

answered Nov 29 '16 at 21:41

virtex

441
2
4
9

I did get an error about space. When trying the other syntax it shows me what looks like a warning: `Refusing to explicitly operate on system chunks. Pass --force if you really want to do that.` Is that OK to do? – Austin Nov 30 '16 at 19:06
try it without the `-susage=0` option. – virtex Nov 30 '16 at 20:20

score 2 · Answer 3 · answered Nov 30 '16 at 18:21

On my system, I added the following job in cron.monthly.

The clear_cache remount is due to some corruption issues btrfs was having with the free maps. (I think they finally found the issue, but the issue is so annoying, I'm willing to pay to rebuild the maps once a month.)

I ramp up the usage options to free up space gradually for larger and larger balances.

#!/bin/sh

for mountpoint in `mount -t btrfs | awk '{print $3}' | sort -u`
do
    echo --------------------------
    echo Balancing $mountpoint :
    echo --------------------------
    echo remount with clear_cache...
    mount -oremount,clear_cache $mountpoint
    echo Before:
    /usr/sbin/btrfs fi show $mountpoint
    /usr/sbin/btrfs fi df $mountpoint
    for size in 0 1 5 10 20 30 40 50 60 70 80 90
    do
        time /usr/sbin/btrfs balance start -v -musage=$size $mountpoint 2>&1
        time /usr/sbin/btrfs balance start -v -dusage=$size $mountpoint 2>&1
    done
    echo After:
    /usr/sbin/btrfs fi show $mountpoint
    /usr/sbin/btrfs fi df $mountpoint
done

If you get to the point where you can't rebalance because you have insufficient space, the recommendation is to temporarily add another block device (or loopback device on another disk) of some sort to your volume for the duration of the rebalance, and then remove it.

Spooler · Answer 4 · 2016-12-01T20:31:22.543

This isn't so much an issue with btrfs, so much as this is something that has been done to this system. This looks like the result of an incomplete rebalance from a 'single' allocation policy to a 'raid 10' allocation policy, as evidenced by the large amount of single allocated blocks. It probably started as single and then a conversion was interrupted. A pool with such inconsistent allocation is bound to have... well, allocation issues.

Consider that you have 61% of your pool consumed. Your allocation policy is RAID10, so that should result in a maximum of 50% pool consumption before reaching full, as everything is replicate 2. This is why your conversion from single to RAID 10 has failed (and continues to). I can only guess, but it was probably allocated to in the middle of a rebalance. There is no space left on your device to rebalance to a RAID 10 with the disks you have. The only reason you got to 61% is because your disks are inconsistency allocated, some linearly with single allocation, and most in RAID 10.

You could rebalance to a single allocation policy if you wanted to gain space without changing much of anything. You could also add more disks or increase the size of the disks. OR you could, as you have done in this case, just delete a bunch of files so that your pool can actually balance to RAID 10 (as it would be less than 50% consumed overall). Do make sure you rebalance after deleting files, or you'll still have this janky allocation policy.

Specifically, enforce RAID 10 when rebalancing after deleting those files to make sure you get rid of those single allocated blocks, like so:

btrfs fi balance start -dconvert=raid10 -mconvert=raid10 /home

"No space left on device" error despite having plenty of space, on btrfs

Update

4 Answers4