-2

I'm trying to make a list of all the files and folders on a mounted NTFS Volume, and I made 2 ways to do it so far, all yielding different results (unfortunately).

(NOTE: I couldn't include additional sources here because link limit)

There are a few things I would like cleared up:

(1) How come certain files/folders have weird unrecognizable characters in the middle of the name? and how do I write print them to wstringstream and then how would I properly write them to a wofstream?

Example file path: C:\Users\Rahul\AppData\Local\Packages\winstore_cw5n1h2txyewy\LocalState\Cache\4\4-https∺∯∯wscont.apps.microsoft.com∯winstore∯6.3.0.1∯100∯US∯en-us∯MS∯482∯features1908650c-22a4-485e-8e88-b12d01c84f2f.json.dat

How it appears if you were to use dir in cmd: C:\Users\Rahul\AppData\Local\Packages\winstore_cw5n1h2txyewy\LocalState\Cache\4\4-https???wscont.apps.microsoft.com?winstore?6.3.0.1?100?US?en-us?MS?482?features1908650c-22a4-485e-8e88-b12d01c84f2f.json.dat

How it appears if you were to use wprintf in C++: C:\Users\Rahul\AppData\Local\Packages\winstore_cw5n1h2txyewy\LocalState\Cache\4\4-https

The file name shows properly in windows explorer, but has trouble being printed in cmd. It appears as a box in notepad++, but if you right-click, it shows it properly, so notepad++ can also display the characters properly (sort-of, encoding change maybe?).

I'm currently using (ss is the stringstream, initialized as wstingstream ss("");)

wstringstream ss("");
(my program methods here)
wofstream out("...", wofstream::out);
out << ss.rdbuf();
out.close();

I'm assuming that the encoding has at least something to do with it, but at the same time, I'm not sure which flags to use.

(2) Are all files listed in the MFT? Every link on NTFS says that all file information and attributes are stored in the MFT, but according to the open source NTFSLib (have a link limit, can be found by googling An-NTFS-Parser-Lib), there are 131840 file records.

When I run my own program, I end up with this 50MB file (includes permissions and the such). My program uses FSCTL_MFT_ENUM_USN_DATA and CreateFile for handles and GetFileInformationByHandle for getting extended information. CreateFile takes in the WCHAR* normally, and doesn't have the weird null termination issues (I think, maybe, not even sure anymore, this might be where the missing files are).

It shows that there are 129454 files that it could read, I'm assuming that the other 131840-129454=2386 files are files that were deleted but are still in the USN journal.

(3) How come my Java version of the code outputs more file records than the MFT even contains?

The output of my Java code is a 150MB file (includes permissions, enumerates with names instead of symbols because I don't know how to not do that, so it's way bigger).

As you can see here, there are 161430 file records in this one. That's more than what NTFSLib said there are. Yes, it is the case that probably many of those 131840 file records are 'additional names', but I explicitly avoided symlinks in my Java version. Is it the case that those extra 30000 files are generated from hardlinks or somehow having more names is independent from being symlinks?

Rahul Manne
  • 1,229
  • 10
  • 20
  • `How it appears if you were to use wprintf in C++: ` More than likely, `wprintf` stops at the first null byte, so you don't get the entire string. – PaulMcKenzie Aug 18 '14 at 21:42
  • NTFS uses unicode (16 bit characters) for file names and directories. In addition some programs, like securerom, use somewhat garbled file names to make them difficult to modify or delete. There are also [reparse points](http://en.wikipedia.org/wiki/NTFS_reparse_point) with stuff like [junction points](http://en.wikipedia.org/wiki/NTFS_junction_point) that "fake" old directory paths like "documents and settings". Windows FindFirstFile and FindNextFile should work, but you may have to set the process token permissions to something like "backup" in order to access all directories. – rcgldr Aug 18 '14 at 21:46
  • And in the case of URLs in file paths, `:` and `/` are illegal characters, so they have to be replaced, and `∺` (U+223A GEOMETRIC PROPORTION) and `∯` (U+222F SURFACE INTEGRAL) happened to be the replacements chosen by whatever app created those files. – Remy Lebeau Aug 18 '14 at 21:50
  • @PaulMcKenzie But why would there be a null byte in the middle of a string?! And how would I deal with it? I mean, Java's native Files.walkFileTree(...) seems to be pretty good at it, and so is windows explorer. How would I make it print properly? – Rahul Manne Aug 18 '14 at 21:50
  • @RahulManne: the strings you showed do not have nulls in them, so either `wprintf()` decided to stop on the first UTF-16 surrogate it could not handle (which happened to be the `∺` character), or the console display itself stopped outputting when it hit that character. – Remy Lebeau Aug 18 '14 at 21:54
  • @RemyLebeau Yeah, I think your first guess is right. Is there any alternative to wofstream that can print these utf-16 characters? – Rahul Manne Aug 18 '14 at 21:56
  • @RahulManne - You never mentioned if your application is a UNICODE or MBCS application. If it is MBCS, then wsprintf is interpreting that string as 8-bit characters, and terminates on the first `byte` (not character) that's null. This is different than Java, where characters are 16-bits, regardless. – PaulMcKenzie Aug 18 '14 at 22:06
  • @PaulMcKenzie: you are assuming `wsprintf()` (ie the Win32 API function - note the `s`). `wprintf()` (ie the C runtime function - note the lack of `s`) does not suffer from that problem. And as I said in my earlier comment, there are no nulls in the string that Rahul showed anyway. – Remy Lebeau Aug 18 '14 at 22:07

1 Answers1

0

Solution to (1):

You must write your own library that can write UTF-16, since writing sometimes will run into cases where the characters are misaligned and will think that there is a null, for example: 0xD00A may run into the 0x00 character during a misalign and thus will terminate.

I used the following two files to write out as unicode. Handles wchar_t, wchar_t*, char, char*, unsigned long, and unsigned long long: UTF16.h, UTF16.c

(2,3):

Yes, they're all there. You can find the number of links in the GetInformationByHandle method and this will count up to the number of files that the Java one contains.

Still looking for: How do you list the names of all the links to the file record in the MFT?

Rahul Manne
  • 1,229
  • 10
  • 20