How to recover unrecoverable errors in a btrfs RAID1 volume?

Question

I have a 10TB BTRFS volume made of 7 whole-disk volumes (no partitions) in a JBOD server with each volume being a physical drive mounted as single-drive RAID0*. The BTRFS volume with the 7 drives was created as RAID1 data, metadata and system, meaning that there is only 5TB of usable space.

The set-up had some power outages and the volume is now corrupted.

I started a btrfs scrub that took 10 hours, it recovered some errors but still has unrecoverable errors. Here's the log :

scrub status:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:3|data_extents_scrubbed:43337833|tree_extents_scrubbed:274036|data_bytes_scrubbed:2831212044288|tree_bytes_scrubbed:4489805824|read_errors:0|csum_errors:0|verify_errors:0|no_csum:45248|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:2908834758656|t_start:1548346756|t_resumed:0|duration:33370|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:4|data_extents_scrubbed:6079208|tree_extents_scrubbed:57260|data_bytes_scrubbed:397180661760|tree_bytes_scrubbed:938147840|read_errors:0|csum_errors:0|verify_errors:0|no_csum:5248|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:409096683520|t_start:1548346756|t_resumed:0|duration:6044|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:5|data_extents_scrubbed:13713623|tree_extents_scrubbed:63427|data_bytes_scrubbed:895829155840|tree_bytes_scrubbed:1039187968|read_errors:67549319|csum_errors:34597|verify_errors:45|no_csum:40128|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:67546631|corrected_errors:37330|last_physical:909460373504|t_start:1548346756|t_resumed:0|duration:20996|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:6|data_extents_scrubbed:44399586|tree_extents_scrubbed:267573|data_bytes_scrubbed:2890078298112|tree_bytes_scrubbed:4383916032|read_errors:0|csum_errors:0|verify_errors:0|no_csum:264000|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:2908834758656|t_start:1548346756|t_resumed:0|duration:35430|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:7|data_extents_scrubbed:13852777|tree_extents_scrubbed:0|data_bytes_scrubbed:898808254464|tree_bytes_scrubbed:0|read_errors:0|csum_errors:0|verify_errors:0|no_csum:133376|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:909460373504|t_start:1548346756|t_resumed:0|duration:20638|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:8|data_extents_scrubbed:13806820|tree_extents_scrubbed:0|data_bytes_scrubbed:896648761344|tree_bytes_scrubbed:0|read_errors:0|csum_errors:0|verify_errors:0|no_csum:63808|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:909460373504|t_start:1548346756|t_resumed:0|duration:20443|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:9|data_extents_scrubbed:5443823|tree_extents_scrubbed:0|data_bytes_scrubbed:356618694656|tree_bytes_scrubbed:0|read_errors:0|csum_errors:0|verify_errors:0|no_csum:0|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:377958170624|t_start:1548346756|t_resumed:0|duration:3199|canceled:0|finished:1

I then unmounted the volume and did btrfs check --repair with this output :

Checking filesystem on /dev/sdb
UUID: 1ea7ff96-0c60-46c3-869c-ae398cd106a8
checking extents [o]
cache and super generation don't match, space cache will be invalidated
checking fs roots [o]
checking csums
checking root refs
found 4588612874240 bytes used err is 0
total csum bytes: 4474665852
total tree bytes: 5423104000
total fs tree bytes: 734445568
total extent tree bytes: 71221248
btree space waste bytes: 207577944
file data blocks allocated: 4583189770240
 referenced 4583185391616

and now I can't mount the volume with mount -a, with this output :

mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

inspecting dmesg, messages during scrub were outputted :

[37825.838303] BTRFS error (device sde): bdev /dev/sdf errs: wr 67699124, rd 67694614, flush 0, corrupt 34597, gen 45
[37826.202827] sd 1:1:0:4: rejecting I/O to offline device

Later mounting errors in dmesg are as follows:

[pciavald@Host-005 ~]$ sudo mount -a
[63078.778765] BTRFS info (device sde): disk space caching is enabled
[63078.778771] BTRFS info (device sde): has skinny extents
[63078.779882] BTRFS error (device sde): failed to read chunk tree: -5
[63078.790696] BTRFS: open_ctree failed

[pciavald@Host-005 ~]$ sudo mount -o recovery,ro /dev/sdb /data
[75788.205006] BTRFS warning (device sde): 'recovery' is deprecated, use 'usebackuproot' instead
[75788.205012] BTRFS info (device sde): trying to use backup root at mount time
[75788.205016] BTRFS info (device sde): disk space caching is enabled
[75788.205018] BTRFS info (device sde): has skinny extents
[75788.206382] BTRFS error (device sde): failed to read chunk tree: -5
[75788.215661] BTRFS: open_ctree failed

[pciavald@Host-005 ~]$ sudo mount -o usebackuproot,ro /dev/sdb /data
[76171.713546] BTRFS info (device sde): trying to use backup root at mount time
[76171.713552] BTRFS info (device sde): disk space caching is enabled
[76171.713556] BTRFS info (device sde): has skinny extents
[76171.714829] BTRFS error (device sde): failed to read chunk tree: -5
[76171.725735] BTRFS: open_ctree failed

From the scrub log, it seems that all unrecoverable errors are located on a single hard drive, devid 5. Also, the errors seem to be related to drive /dev/sdf from the dmesg messages. The scrub log indicates all errors on device 1ea7ff96-0c60-46c3-869c-ae398cd106a8:5.

*: I know using BTRFS not directly on physical drives but on volumes managed by a physical RAID driver is not the best option but I had no choice. Each drive inserted in the array is formatted as a single RAID0 drive, which makes it visible to the OS. These logical drives were formatted as full-volume BTRFS drives and added to the BTRFS device with duplication of data and metadata.

EDIT: I went down to the server to reboot it to a newer kernel and noticed that the drive with the errors /dev/sdf had the fail state LED on. I shut down the server, restarted the JBOD and the server, and it turned green. The volume is currently mounted correctly and I relaunched the scrubbing. After 6 minutes, status already had errors but there is no indication whether they could be corrected:

scrub status for 1ea7ff96-0c60-46c3-869c-ae398cd106a8
        scrub started at Fri Jan 25 11:53:28 2019, running for 00:06:31
        total bytes scrubbed: 243.83GiB with 3 errors
        error details: super=3
        corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

When scrub ended after 8 hours this time, the output is as follows :

scrub status for 1ea7ff96-0c60-46c3-869c-ae398cd106a8
        scrub started at Fri Jan 25 11:53:28 2019 and finished after 07:59:20
        total bytes scrubbed: 8.35TiB with 67549322 errors
        error details: read=67549306 super=3 csum=13
        corrected errors: 2701, uncorrectable errors: 67546618, unverified errors: 0

The new log for that scrub is as follows :

1ea7ff96-0c60-46c3-869c-ae398cd106a8:3|data_extents_scrubbed:43337833|tree_extents_scrubbed:273855|data_bytes_scrubbed:2831212044288|tree_bytes_scrubbed:4486840320|read_errors:0|csum_errors:0|verify_errors:0|no_csum:45248|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:2908834758656|t_start:1548413608|t_resumed:0|duration:26986|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:4|data_extents_scrubbed:6079208|tree_extents_scrubbed:57127|data_bytes_scrubbed:397180661760|tree_bytes_scrubbed:935968768|read_errors:0|csum_errors:0|verify_errors:0|no_csum:5248|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:409096683520|t_start:1548413608|t_resumed:0|duration:6031|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:5|data_extents_scrubbed:13713623|tree_extents_scrubbed:63206|data_bytes_scrubbed:895829155840|tree_bytes_scrubbed:1035567104|read_errors:67549306|csum_errors:13|verify_errors:0|no_csum:40128|csum_discards:0|super_errors:3|malloc_errors:0|uncorrectable_errors:67546618|corrected_errors:2701|last_physical:909460373504|t_start:1548413608|t_resumed:0|duration:14690|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:6|data_extents_scrubbed:44399652|tree_extents_scrubbed:267794|data_bytes_scrubbed:2890081705984|tree_bytes_scrubbed:4387536896|read_errors:0|csum_errors:0|verify_errors:0|no_csum:264832|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:2908834758656|t_start:1548413608|t_resumed:0|duration:28760|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:7|data_extents_scrubbed:13852771|tree_extents_scrubbed:0|data_bytes_scrubbed:898807992320|tree_bytes_scrubbed:0|read_errors:0|csum_errors:0|verify_errors:0|no_csum:133312|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:909460373504|t_start:1548413608|t_resumed:0|duration:14372|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:8|data_extents_scrubbed:13806827|tree_extents_scrubbed:0|data_bytes_scrubbed:896649023488|tree_bytes_scrubbed:0|read_errors:0|csum_errors:0|verify_errors:0|no_csum:63872|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:909460373504|t_start:1548413608|t_resumed:0|duration:14059|canceled:0|finished:1
1ea7ff96-0c60-46c3-869c-ae398cd106a8:9|data_extents_scrubbed:5443823|tree_extents_scrubbed:3|data_bytes_scrubbed:356618694656|tree_bytes_scrubbed:49152|read_errors:0|csum_errors:0|verify_errors:0|no_csum:0|csum_discards:0|super_errors:0|malloc_errors:0|uncorrectable_errors:0|corrected_errors:0|last_physical:377991725056|t_start:1548413608|t_resumed:0|duration:3275|canceled:0|finished:1

The same volume has uncorrectable errors, so I tried listing the btrfs volumes and devid 5 is missing from the list :

[pciavald@Host-001 ~]$ sudo btrfs fi show /data
Label: 'data'  uuid: 1ea7ff96-0c60-46c3-869c-ae398cd106a8
        Total devices 7 FS bytes used 4.17TiB
        devid    3 size 2.73TiB used 2.65TiB path /dev/sdd
        devid    4 size 465.73GiB used 381.00GiB path /dev/sde
        devid    6 size 2.73TiB used 2.65TiB path /dev/sdb
        devid    7 size 931.48GiB used 847.00GiB path /dev/sdc
        devid    8 size 931.48GiB used 847.00GiB path /dev/sdg
        devid    9 size 931.48GiB used 352.03GiB path /dev/sdh
        *** Some devices missing

Here all devices are listed except devid 5 and /dev/sdf so I guess the broken drive is this one. Because data is duplicated, i should be able to delete this device and rebalance the setup, so I tried it :

[pciavald@Host-001 ~]$ sudo btrfs device delete /dev/sdf /data
ERROR: error removing device '/dev/sdf': No such device or address

How can I properly delete that device ?

EDIT 2: I went to IRC freenode #btrfs in order to get help, and did the following investigation. On the usage we can see the overall system with data secured over 2 different drives:

[pciavald@Host-001 ~]$ sudo btrfs fi usage /data
Overall:
    Device size:                   9.55TiB
    Device allocated:              8.49TiB
    Device unallocated:            1.06TiB
    Device missing:              931.48GiB
    Used:                          8.35TiB
    Free (estimated):            615.37GiB      (min: 615.37GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID1: Size:4.24TiB, Used:4.17TiB
   /dev/sdb        2.64TiB
   /dev/sdc      847.00GiB
   /dev/sdd        2.64TiB
   /dev/sde      380.00GiB
   /dev/sdf      846.00GiB
   /dev/sdg      847.00GiB
   /dev/sdh      352.00GiB

Metadata,RAID1: Size:6.00GiB, Used:5.05GiB
   /dev/sdb        5.00GiB
   /dev/sdd        5.00GiB
   /dev/sde        1.00GiB
   /dev/sdf        1.00GiB

System,RAID1: Size:64.00MiB, Used:624.00KiB
   /dev/sdb       64.00MiB
   /dev/sdd       32.00MiB
   /dev/sdh       32.00MiB

Unallocated:
   /dev/sdb       85.43GiB
   /dev/sdc       84.48GiB
   /dev/sdd       85.46GiB
   /dev/sde       84.73GiB
   /dev/sdf       84.48GiB
   /dev/sdg       84.48GiB
   /dev/sdh      579.45GiB

On btrfs dev stats /data we can see that all errors are located on /dev/sdf indicating that the scrub's unrecoverable errors were not due to errors in the mirror copy of the broken data but instead because of the OS not being able to read/write correctly on the defective drive:

[/dev/sdd].write_io_errs    0
[/dev/sdd].read_io_errs     0
[/dev/sdd].flush_io_errs    0
[/dev/sdd].corruption_errs  0
[/dev/sdd].generation_errs  0
[/dev/sde].write_io_errs    0
[/dev/sde].read_io_errs     0
[/dev/sde].flush_io_errs    0
[/dev/sde].corruption_errs  0
[/dev/sde].generation_errs  0
[/dev/sdf].write_io_errs    135274911
[/dev/sdf].read_io_errs     135262641
[/dev/sdf].flush_io_errs    0
[/dev/sdf].corruption_errs  34610
[/dev/sdf].generation_errs  48
[/dev/sdb].write_io_errs    0
[/dev/sdb].read_io_errs     0
[/dev/sdb].flush_io_errs    0
[/dev/sdb].corruption_errs  0
[/dev/sdb].generation_errs  0
[/dev/sdc].write_io_errs    0
[/dev/sdc].read_io_errs     0
[/dev/sdc].flush_io_errs    0
[/dev/sdc].corruption_errs  0
[/dev/sdc].generation_errs  0
[/dev/sdg].write_io_errs    0
[/dev/sdg].read_io_errs     0
[/dev/sdg].flush_io_errs    0
[/dev/sdg].corruption_errs  0
[/dev/sdg].generation_errs  0
[/dev/sdh].write_io_errs    0
[/dev/sdh].read_io_errs     0
[/dev/sdh].flush_io_errs    0
[/dev/sdh].corruption_errs  0
[/dev/sdh].generation_errs  0

I've ordered a new 1TB drive to replace /dev/sdf and will write an answer to this question once i've managed replacing it.

Pierre-Alexis Ciavaldini · Answer 1 · 2019-01-28T16:25:16.017

After inserting the new 1TB drive into the array, I formatted it as single RAID0 drive in the array configuration utility to make it visible to the OS. Then without creating any partition table or partitions on the drive, I issued the following command:

sudo btrfs replace start -r -f 5 /dev/sdi /data

Let's break this down: we want btrfs to start replacing devid 5 (we use this notation instead of /dev/sdf because it could be in a missing state) with the new drive /dev/sdi just inserted. -r is used to only read from srcdev if no other zero-defect mirror exists and -f to force overwriting the target disk. /data is our mountpoint.

2h30 later the replacing is finished, with this status:

Started on 27.Jan 21:57:31, finished on 28.Jan 00:19:23, 0 write errs, 0 uncorr. read errs

To be noted that we had one more power outage during the replace, and it lasted exactly 10 seconds longer than the new UPS could handle, so the server went down during the replace operation. After restarting the server, the replacing resumed without issuing any command, it was already started when I issued a status command at boot.

I then scrubed the volume again, here's the output:

scrub status for 1ea7ff96-0c60-46c3-869c-ae398cd106a8
        scrub started at Mon Jan 28 00:24:15 2019 and finished after 06:45:57
        total bytes scrubbed: 8.35TiB with 212759 errors
        error details: csum=212759
        corrected errors: 212759, uncorrectable errors: 0, unverified errors: 0

Everything is now corrected, I relaunched the scrub to make sure no more errors are being corrected :

scrub status for 1ea7ff96-0c60-46c3-869c-ae398cd106a8
        scrub started at Mon Jan 28 10:19:24 2019 and finished after 06:33:05
        total bytes scrubbed: 8.35TiB with 0 errors

How to recover unrecoverable errors in a btrfs RAID1 volume?

1 Answers1