The first example which comes to my mind: If you create a file under OSX that is named é
(single U+00E9
codepoint), the OS will store it actually as U+0065 U+0301
(Unicode decomposition). The file will be still accessible under the original name, but listed as decomposed.
How to avoid: don't lookup your files manually unless you are sure their names are pure ASCII.
Second: On Windows, if you have a file called e
, try creating (with overwriting enabled) a file called E
, the OS will still list a file called e
. If e
didn't exists beforehand, a file called E
would be created.
How to avoid: don't lookup your files manually unless you are sure their names are pure ASCII, and take case into account. Try using a consistent capitalisation style. I suggest going all lowercase.
Third: on Windows, if for example you have Windows 1250 as your system encoding, and you want to create a file named ê
via the narrow, char-based API, a file called e
will be created instead. This of course is easy to avoid, but this exact problem bit me once: WinRAR extracted files ê.png
, è.png
and e.png
all into e.png
, overwriting data. Similar problems can happen with other encoding mixups, too.
How to avoid: don't use API's that take the filename as a char*
on Windows.