When and how might the operating system store a file under a different name than I gave it?

Question

I found this statement under another SO question concerning Unicode and I'd like to ask for further elaboration of this rather surprising fact.

Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, you'll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!

When does this happen and what to do about it?

score 1 · Accepted Answer · answered Dec 07 '14 at 22:29

The first example which comes to my mind: If you create a file under OSX that is named é (single U+00E9 codepoint), the OS will store it actually as U+0065 U+0301 (Unicode decomposition). The file will be still accessible under the original name, but listed as decomposed.

How to avoid: don't lookup your files manually unless you are sure their names are pure ASCII.

Second: On Windows, if you have a file called e, try creating (with overwriting enabled) a file called E, the OS will still list a file called e. If e didn't exists beforehand, a file called E would be created.

How to avoid: don't lookup your files manually unless you are sure their names are pure ASCII, and take case into account. Try using a consistent capitalisation style. I suggest going all lowercase.

Third: on Windows, if for example you have Windows 1250 as your system encoding, and you want to create a file named ê via the narrow, char-based API, a file called e will be created instead. This of course is easy to avoid, but this exact problem bit me once: WinRAR extracted files ê.png, è.png and e.png all into e.png, overwriting data. Similar problems can happen with other encoding mixups, too.

How to avoid: don't use API's that take the filename as a char* on Windows.

When and how might the operating system store a file under a different name than I gave it?

1 Answers1