3

I would like to programmatically test Windows ReFS Health Check and Recovery features.

Note: ReFS only detects bitrot (no self-healing). To have ReFS both detect and auto-heal, one must also use Storage Spaces. So, I have prepared a Storage Mirror Space pool S:\ with 2-way mirror setup.

ReFS integrity streams have been enable with,

PS C:\> Set-FileIntegrity -FileName 'S:\' -Enable $True

as per instructions found here.

How can I programmatically simulate file corruption to test ReFS Health Check and Recovery features?

I can't find an easy way to introduce bit-rot. All system I tried were performing only changes acceptable to ReFS as legitimate.

A PowerShell method would be best, if possible. Perl, Python or any other good too.

Thank you in advance.

gsl
  • 1,063
  • 1
  • 16
  • 27

4 Answers4

4

To create corruption, use the destructive write test within Hard Disk Sentinel Pro. Set it to work randomly rather than sequentially. I set it to write random patterns of bits. Just run it for one to three minutes, and you’ll see on the displayed map a whole bunch of spots all over the drive getting destroyed.

Here’s how I did some testing (I’m typing fast so I hope I don’t leave something out)

  1. Nearly fill an ReFS mirrored storage space with files.

  2. Enable file integrity for all the files:

    Get-ChildItem -Path ‘i:*’ -Recurse | Set-FileIntegrity -Enable $True -Enforce $False

We do another test later with Enforce $True, but do the false one first. You’ll see why later. Read up on Enable and Enforce.

  1. Remove one of the drives and attach it to a SATA port on a second computer.
  2. On that second computer, introduce file corruption with Hard Disk Sentinel
  3. Remove the corrupted drive and put it back in the first computer with the storage space. You will now have a mirrored storage space where one drive is okay and one has a bunch of corrupted files.

Try a mass copy of all the files from the storage space over to some other drive.

My tests show that almost nothing gets repaired and almost nothing shows up in the event log. Maybe one or two errors and that’s it. You might think perhaps not much was corrupted in the first place. Well now set Enforce $True and do the copy operation again. With Enforce on, the copy will stop at dozens of files with checksum errors---proving that ReFS in that case is looking at checksums.

Problem is that again almost nothing shows up in the log. Also, with Enforce on, I got a checksum error on the one file that had supposedly been fixed during the first test with Enforce off!

Check these threads:

Why Use ReFS?

ReFS test with corrupt data. Does it work

Has anyone run the Data Integrity Scan on an ReFS volume?

ReFS/Storage Spaces does log a problem from time to time and people see that so they just assume it works great. Also, folks can't find a good way to create test corruption so they don't bother testing. I tested on the Windows 10 Pro for Workstations SKU and the results are terrible.

Please run some tests yourself to confirm my findings.

Community
  • 1
  • 1
Mark_482
  • 56
  • 1
  • 1
    Thank you, although this procedure is not easy to adapt in a programmatical way, it does help towards that direction. Also, I did test as per your indications and found same results as you did, i.e. ReFS is unable, at this time, to catch bit-rot and to heal it, and thus is unreliable for "corruption-sensitive" projects. – gsl Sep 11 '18 at 07:48
1

Sounds like you want to write to the underlying storage directly, bypassing the file system. This means writing straight to the disk/partition/volume. In Windows, this can be done by working against lower-level constructs, such as \\.\PhysicalDrive0 - you can open a "file" handle to such a device and write directly to the sectors. You might find some low-level tools that do just that.

In Linux this is somewhat easier, since you can use dd to write to any block device.

If your Windows machine is a VM, then it might be easiest to edit the VHDX file (the "hard disk") from the host machine, perhaps using a HEX editor.

It might be a bit hard to map a specific file to the on-disk sectors containing its data runs. There are several methods of detecting where the data really is, but you may resort to a simple brute-force method of writing a specific piece of seemingly-unique data and simply scanning the entire disk to find it.

M.A. Hanin
  • 8,044
  • 33
  • 51
  • Thank you. For now the only information I could gather is using C#, which I do not know. Let me see if I can find a way with VB. – gsl Aug 30 '18 at 15:47
1

Just a word of warning: you need to be very, very, very, careful if you decide to enable Integrity Streams.

Short Version

Integrity Streams disables all resiliency, and will cause ReFS to delete files that have a read error.

Long Version

With Integrity Streams enabled, and there is a read error, ReFS will delete the bad file.

  • if you have a 300 GB file (e.g. WindowsSever2012R2_prod.vhdx)
  • and ReFS detects even a single uncorrectable bit
  • it will delete the entire file

No confirmation. No warning. No appeal.

One unrecoverable read error, and all the data is gone. And if you didn't like it: you shouldn't have enabled integrity streams.

This behavior is very quietly documented, using very innocuious terminology:

Resilient File System (ReFS) overview

Key Benefits

Resiliency

  • Salvaging data - If a volume becomes corrupted and an alternate copy of the corrupted data doesn't exist, ReFS removes the corrupt data from the namespace. ReFS keeps the volume online while it handles most non-correctable corruptions, but there are rare cases that require ReFS to take the volume offline.

(emphasis mine).

Storage Spaces will log to the Windows event log that your data is now gone:

  • Source: Microsoft-Windows-ReFS
  • Event ID: 513

(Warning): The file system detected a corruption on a file. The file has been removed from the file system namespace. The name of the file is "M:\VirtualDisks\WindowsServer2012R2_Prod.vhdx".

So, "warning", we deleted a virtual server, and everything on it, because we found one bad bit.

By enabling Integrity streams you are specifically opting-in to this *un-*resilient feature.

It's possible the data isn't actually deleted

The documentation notes:

ReFS removes the corrupt data from the namespace

And that is true; you don't find the file anywhere - it's gone. If you had a 1.2 TB database file, and there was a single unrecoverable bit, your 1.2 TB of data is gone - just like deleting a file.

But the file continues to use up space in the storage space. In other words, it seems the file is actually still kept around, but it is "inaccessible".

But given that there is no documented or known way to make the file "accessible" again (i.e. "undelete" it), the result is the same - your data is deleted.

So just be aware

A fundamental design goal of integrity streams is:

  • we would rather delete your data
  • than expose a read error

If you enable integrity streams: then you are agreeing that you're in a case where you would rather data be deleted than risk returning partial data.

I can't think of any situation, anywhere, in any industry, in any part of the world, where someone would want their "resilient" filesystem to intentionally delete EVERYTHING in the case of one read error.

But that's what you're asking for.

I guess it somewhat makes sense:

  • "i valid integrity"
  • "over resiliency"

How to simulate it

Run HxD as an Administrator, open the \.\PhysicalDiskx for writing:

enter image description here

flip one bit, and save.

enter image description here

I'm not going to actually do it for this demo, because on my 3-way mirror i have integrity streams enabled; and i don't want to lose all my data.

Update: Maybe you can opt-out of losing all your data

I was running commands to make sure i had fully disabled any accidental use anywhere of Integrity Streams, and the output of the Get-FileIntegrity had something odd:

PS M:\> Get-Item . | Get-FileIntegrity

FileName  Enabled  Enforced
--------  -------  --------
M:\       False    True

What the hell could "Enforced" possibly mean? How could Integrity Streams be "enforced on", but not enabled? As usual, the documentation of Get-FileIntegrity contains no documentation.

So i tried checking the documentation of Set-FileIntegrity; and it has something!

-Enforce

Indicates whether to enable blocking access to a file if integrity streams indicate data corruption.

If you specify a value of $True for this parameter, the cmdlet also enables integrity for the file.

That's it!

  • That's the broken by default feature,
  • that will cause you complete data loss
  • in the event of the tiniest un-correctable read error

Mother forking shirt-balls. Who's the forking bench that came up with that one.

And it's on by default! That feature should absolutely not be enabled by default! The feature shouldn't even exist!

So now i'm running a command to recursively set the options:

  • integrity streams: disabled
  • silently delete all your data: disabled

Commands

First is the command to find any files with integrity Enabled:

PS M:\> Get-ChildItem -Recurse | Get-FileIntegrity | Where {$_.Enabled -EQ $true}

FileName                                    Enabled Enforced
--------                                    ------- --------
M:\Folder1\Folder 2\The Video File 102.mkv  True    True
M:\Folder1\Folder 2\The Video File 101.mkv  True    True
M:\Folder1\Folder 2\The Video File 103.mkv  True    True
M:\Folder1\Folder 2\The Video File 104.mkv  True    True
M:\Folder1\Folder 2\The Video File 108.mkv  True    True
M:\Folder1\Folder 2\The Video File 109.mkv  True    True
M:\Folder1\Folder 2\The Video File 105.mkv  True    True
M:\Folder1\Folder 2\The Video File 111.mkv  True    True
M:\Folder1\Folder 2\The Video File 106.mkv  True    True
M:\Folder1\Folder 2\The Video File 107.mkv  True    True
M:\Folder1\Folder 2\The Video File 112.mkv  True    True
M:\Folder1\Folder 2\The Video File 110.mkv  True    True

So, yes, i had some data at risk. Now we want to turn it off:

PS M:\> Get-ChildItem -Recurse | Get-FileIntegrity | Where {$_.Enabled -EQ $true} | Set-FileIntegrity -Enable $False

Bonus Reading

Ian Boyd
  • 246,734
  • 253
  • 869
  • 1,219
  • Thank you so much for this. Truly a lifesaver. You mention that, notwithstanding this flaw, you still use integrity streams on one of your machines. Does that mean it has some values for you after all? Or you rely on backups, in case something goes astray? – gsl Nov 17 '19 at 19:20
  • 1
    I'm in the process now of turning off integrity streams. `Get-ChildItem -Recurse 'M:\*' | Set-FileIntegrity -Enable $False -Enforce $False`. But i did find an interesting tidbit in `Set-FileIntegrity`: you can turn off the feature that blocks your data! – Ian Boyd Nov 17 '19 at 21:13
  • Thank you, good news indeed! When you'll have more data, would you feel like updating your answer, including details like corrupt data blocking switch syntax? – gsl Nov 18 '19 at 10:43
1

I just want to note that I tested this behavior on the latest Windows 11, and here are my observations:

1 - If you have mirrored storage spaces, and data corruption in a file on a single disk, REFS will fix it automatically for you with integrity streams enabled Ref Image -> 1

2 - If the corruption is on both disks, REFS will block access to the file, but it will not delete it. You can regain access to the file by manually changing the enforce flag to false (e.g. Get-Item 'E:\claclaboth.txt' | Set-FileIntegrity -Enable $True -Enforce $False) Ref Image -> 2

3 - Periodic scrubbing from the task scheduler (under Windows -> Data Integrity Check and Scan) still seems to do nothing on Windows 11, but the Data Integrity Scan task does seem to scrub the disk properly, you just need to add a trigger to schedule it, otherwise it will never run.

Based on the findings above, I think you can leave the enforced flag default, as REFS will not delete your file, just block it.

Performance issues aside, if one wants ZFS-like resilience but needs to stay on Windows, REFS seems a lot more mature now.

cfelicio
  • 11
  • 2