1

I'm experimenting with FSCTL_MOVE_FILE. Mostly everything is working as expected. However, sometimes if I try to re-read (via FSCTL_GET_NTFS_FILE_RECORD) the Mft record I just moved, I'm getting some bad data.

Specifically, if the file record says the $ATTRIBUTE_LIST attribute is non-resident and I use my volume handle to read the data from the disk, I find that the data there is internally inconsistent (record length is greater than the actual length of data).

As soon as I saw this happening, the cause was pretty clear: I'm reading the record before the Ntfs driver is finished writing it. Debugging supports this theory. But knowing that doesn't help me solve it. I'm using the synchronous method for the FSCTL_MOVE_FILE call, but apparently the file system can still be updating stuff in the background. Hmm.

In a normal file, I'd be thinking LockFileEx with a shared lock (since I'm just reading). But I'm not sure that has any meaning for volume handles? And I'm even less sure Ntfs uses this mechanism internally to ensure consistency.

Still, it seems like a place to start. But my LockFileEx call against the volume handle is returning ERROR_INVALID_PARAMETER. I'm not seeing what parameter may be in error, unless it's the volume handle itself. Perhaps they just don't support locks? Or maybe there's some special flags I'm supposed to set in CreateFile when opening the volume handle? I've tried enabling SE_BACKUP_NAME and FILE_FLAG_BACKUP_SEMANTICS, but the error remains unchanged.

Moving forward, I can see a few alternatives here:

  1. Figure out how to lock sections using a volume handle (and hope the Ntfs driver is doing the same). Seems dubious at this point.
  2. Figure out how to flush the meta data for the file I just moved (nb: FlushFileBuffers for the MOVE_FILE_DATA.FileHandle didn't help. Maybe flushing the volume handle?).
  3. Is there some 'official' means for reading non-resident data that doesn't involve ReadFile against a volume handle? I didn't find one, but maybe I missed it.
  4. Wait "a bit" after moving data to let the driver complete updating everything. Yuck.

FWIW, here's some test code for doing LockFileEx against a volume handle. Note that you must be running as an administrator to lock volume handles. I'm using J:, since that's my flash drive. 50000 was picked at random, but should be less than the size of a flash drive.

void Lock()
{
    WCHAR path[] = L"\\\\.\\j:";

    HANDLE hRootHandle = CreateFile(path,
                             GENERIC_READ, 
                             FILE_SHARE_READ | FILE_SHARE_WRITE, 
                             NULL, 
                             OPEN_EXISTING, 
                             0, 
                             NULL);

    OVERLAPPED olap;
    memset(&olap, 0, sizeof(olap));
    olap.Offset = 50000;

    // Lock 1k of data at offset 50000
    BOOL b = LockFileEx(hRootHandle, 1, 0, 1024, 0, &olap);
    DWORD j = GetLastError();

    CloseHandle(hRootHandle);
}

The code for seeing the bad data is... rather involved. However it is readily reproducible. When it fails, I end up trying to read variable length $ATTRIBUTE_LIST entries that have '0' length, which results in an infinite loop since it looks like I never finished reading the entire buffer. I'm working around it by exiting if the length is zero, but I worry about "leftover garbage" in the buffer instead of nice clean zeros. Detecting that would be impossible, so I'm hoping for a better solution.

Not surprisingly, there isn't a lot of info out there on any of this. So if someone has some experience here, I could use some insight.


Edit 1:

More things that don't quite work:

  • Still no luck on LockFileEx.
  • I tried flushing the volume handle (as Paul suggested). And while this works, it more than doubles my execution time. And, strictly speaking, it still doesn't solve the problem. There's still no guarantee that Ntfs isn't going to change things some more between the FlushFileBuffers and FSCTL_GET_NTFS_FILE_RECORD / ReadFile.
  • I wondered about the 'RecordChanged' timestamp of the $STANDARD_INFORMATION attribute. However, it's not being changed due to these changes to ATTRIBUTE_LIST.
  • Fragmenting a file eventually causes an ATTRIBUTE_LIST to get added, and as fragmentation continues to increase, more DATA records will get added to that list. When a DATA record gets added, the UpdateSequenceNumber (not the one that's part of the MFT_SEGMENT_REFERENCE, the other one) gets updated. Unfortunately, there's a sequence of events to perform this update. And apparently the ATTRIBUTE_LIST buffer 'length' gets updated before the 'UpdateSequenceNumber'. So seeing if the 'UpdateSequenceNumber' has changed doesn't help avoid reading (potentially) bad information.

My next best thought is to see if perhaps Ntfs always zeros the new bytes before updating the record length (or maybe whenever the record length shrinks?). If I can depend on the record length being zero (instead of whatever leftover data might occupy those bytes), I can pretend to call this fixed.

David Wohlferd
  • 7,110
  • 2
  • 29
  • 56
  • 1
    filesystems support locks only on file - not for folder, volume - https://github.com/Microsoft/Windows-driver-samples/blob/master/filesys/fastfat/lockctrl.c#L656 or https://github.com/Microsoft/Windows-driver-samples/blob/master/filesys/cdfs/lockctrl.c#L74 http://read.pudn.com/downloads171/sourcecode/windows/vxd/794585/ntfs/lockctrl.c__.htm – RbMm Jun 10 '18 at 09:10
  • @RbMm - Windows has the source to their drivers online? But you are right, that's what these seem to say. – David Wohlferd Jun 10 '18 at 09:22
  • for fastfat and cdfs - yes, exist src code. for ntfs no. only very old leak src – RbMm Jun 10 '18 at 09:24
  • but i not understand for what you need re-read `FSCTL_GET_NTFS_FILE_RECORD`. we can use this for create initial list of files, however think use `FSCTL_ENUM_USN_DATA` will be faster (can return many records per single call) – RbMm Jun 10 '18 at 09:35
  • `There's still no guarantee that Ntfs isn't going to change things some more between the FlushFileBuffers and FSCTL_GET_NTFS_FILE_RECORD / ReadFile` - indeed. Doing any kind of direct disk access on a mounted volume will carry this risk. But I found [this](https://msdn.microsoft.com/en-us/library/windows/desktop/aa364575%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396). This page states that you can only lock a volume if there are no files open however, so it is of limited usefulness. Of course, Also, I have failed to ask you: why do you need to grub around in the MFT in the first place? – Paul Sanders Jun 10 '18 at 17:05
  • In [MS-FSA 2.1.5.7](https://msdn.microsoft.com/en-us/library/ff469353.aspx), it defines the supported File type for a byte-range lock request to be DataStream (i.e. a regular file or named data stream). Directory and low-level volume opens do not support byte-range locking. – Eryk Sun Jun 10 '18 at 19:08
  • Also, it's a mistake to think of this in terms of handles. The API clearly tells you this is a File object (e.g. `CreateFile`, `ReadFile`, `LockFileEx`). A handle is just a reference to an object. You can't lock a handle. You lock the referenced File object. It doesn't matter which handle you use from whatever process -- as long as it refers to the same object (e.g. if you duplicate the handle into another process). – Eryk Sun Jun 10 '18 at 19:15

2 Answers2

2

The solution to your problem does indeed seem to be to call FlushFileBuffers() with a handle to the volume. Near the bottom of the page MSDN has this to say:

To flush all open files on a volume, call FlushFileBuffers with a handle to the volume. The caller must have administrative privileges...

Other information on that page leads me to believe that this will also flush the metadata, although it doesn't directly say so in this specific case. Perhaps you can update me on this.

To step back from the detail and look at the bigger picture for a moment, there has to be an API for this somewhere, for all sorts of reasons, although I suppose it might not be public.

Paul Sanders
  • 24,133
  • 4
  • 26
  • 48
  • I'll give it a shot. I considered this, but I'm a bit cautious about it. Flushing a volume handle seems fraught: "To flush all open files on a volume, call FlushFileBuffers" - Do you interpret this "all" to mean even ones open in other applications? Ugh. And of course I'll have to open the handle with 'Write' privs. Still, it's just a flash drive. What could go wrong? – David Wohlferd Jun 10 '18 at 04:40
  • It could burst into flames :) But why ugh? Seems pretty harmless to me. And yes of course, everything will get flushed, but I don't think that matters much as it will be transparent to any other processes with files open, apart from the time taken for the disk writes themselves to complete. I guess _that's_ a hit you will just have to take. – Paul Sanders Jun 10 '18 at 04:50
  • As I type this, my (mostly idle) computer has 130 tasks listed in task manager, each of which has an unknown number of files open. Since I'm in a loop, flushing that many files a dozen times a second is not something I'm excited about. Yes, today I'm using a flash drive, but I have larger ambitions. And while you're *probably* right that it won't actually hurt anything, it seems unlikely that has been stress tested very often. I think your hopes for a public api are optimistic. I'm glad my explanation was clear enough to convey the need for one, but other than weirdos, who needs it? – David Wohlferd Jun 10 '18 at 05:02
  • Flushing the volume probably just consults the disk cache and writes out anything pending. I doubt it has anything to do with files at all, per se, and it's probably a very cheap operation if it turns out there's nothing to do. Probably would have helped if I'd said that in the first place. Anyway, try it, and please accept / vote up if it works for you. – Paul Sanders Jun 10 '18 at 05:10
  • Without the flush, my test program (which fragments then defragments a file) ran in just over 11 minutes. Adding the flush changed that to 28 minutes. That's the bad news. The good news is that I didn't get any corrupted records. Whether that's because flushing solved the problem or that it's just running so slow it's hiding it isn't clear. I guess there's always the potential for Ntfs to be mid-update on a file. It's just in my face because of how I'm using it. Apparently other people who need this data just live with it. Have an upvote now. I'll accept if nothing better is suggested. – David Wohlferd Jun 10 '18 at 05:30
  • Interesting, thank you for sharing. I think you solved the problem but the performance hit is a bit disappointing. That's probably because the expliciy flush is interfering with Windows' usual 'lazy' write- back mechanism which will happen less freqently and can therefore be made more efficient. I would look at your logic with a view to calling `FlushFileBuffers` as infrequently as possible. If you can manage that it should run faster. – Paul Sanders Jun 10 '18 at 06:55
  • An interesting side effect: Ntfs records have a field for `UpdateSequenceNumber`. This is distinct from the `SeqNum` that is part of the MFT_SEGMENT_REFERENCE. When I don't flush, this value grows slowly from 2->13 over the course of my fully fragment/defragment of a file. Add flushing and this ranges from 2->336 during frag, up to 629 post defrag. Apparently Ntfs is improving performance by not computing the new metadata until it must. Flushing means it must. I'm trying to think of a way to make use of this. 336 is ~ the number of 0x80 records that get added during my ~13,000 moves. – David Wohlferd Jun 10 '18 at 07:00
  • I'm sure you're on to something there but I'm less sure that would account for 17 minutes of additional wall clock time. My money is still on the cost of those probable extra writes to disk. You could use SysInternals' Process Explorer or the Process Monitor app built into Windows to investigate further. Previous comment deleted, it was a bit silly. – Paul Sanders Jun 10 '18 at 17:01
1

I think I've got it.

To reiterate the goal:

After using FSCTL_GET_NTFS_FILE_RECORD to read a record from the Mft, I kept finding that the ATTRIBUTE_LIST record was in an 'inconsistent state` such that the reported record length was greater than the actual amount of data in the record. Reading data beyond what had been written to seemed risky, as I couldn't be sure whether what I read was valid, or leftover garbage.

To that end, I suggested 4 alternatives that I hoped would let me work around this.

  1. Using LockFileEx on the volume (which seemed like the best answer when I began) turns out to be a complete non-starter. RbMm & eryksun (as well as my own experimentation) provide some pretty compelling evidence this just won't work. As the 'File' in LockFileEx implies, this function only works on files.
  2. Flushing the volume handle makes the symptoms go away. But at a huge (> 100%) penalty in performance. It's also not clear whether the problem is actually solved, or merely hidden behind the slowdown this causes.
  3. The idea of 'some other' api to read non-resident data seems mythical.
  4. Waiting some (unspecified) amount of time after doing a FSCTL_MOVE_FILE is not a plan, it's a hope.

For a brief time, it looked like checking the UpdateSequenceNumber in the NtfsRecord might provide a solution. However, the order of events Ntfs uses when updating a record means that the record length of the ATTRIBUTE_LIST gets updated (well) before the UpdateSequenceNumber.

But then I began thinking about exactly when this might be a problem. If I ignore it, where will it fail?

Currently I am experiencing the problem as the ATTRIBUTE_LIST is growing (since I am deliberately and massively fragmenting a file). And at that time, it's easily detectable due to the zero record length. I've run the program a number times, and while it's just anecdotal, the extra space as the record grows has always been zeroed. This makes sense, as you'd zero out the entire buffer when you first allocate it. Both standard programming practice and observation support this conclusion.

But what about when the record starts to shrink? Or shrinks and then grows? Could you end up with leftover data there instead of the (easily interpreted) zeros?

Then it hit me: The ATTRIBUTE_LIST never shrinks. I was just complaining about this a few weeks ago. Even when you completely defragment the file and all these extra DATA records are no longer required, Ntfs doesn't compact them. And now for the first time I have a glimpse of why that might be. There's a possibility this might change in W10, but that might just be an overly optimistic interpretation of an undocumented function.

So, I don't need to worry about reading garbage data (possibly including a meaningless record length causing me to overrun the buffer). The record length in the ATTRIBUTE_LIST can be trusted. There is just the possibility that the last record might have a zero record length.

I can either ignore the zero length record (essentially returning the pre-growth information) or reread the record until the UpdateSequenceNumber changes (indicating that the update is complete).

Tada.

David Wohlferd
  • 7,110
  • 2
  • 29
  • 56