2

When you use a function like fopen(), you have to pass it a string argument for the filename. I want to know what the character encoding of this string should be.

This question has already been asked here, but it has contradictory answers. One answer says the following:

It depends on the system locale. Look at the output of the "locale" command. If the variables end in UTF-8, then your locale is UTF-8. Most modern linuxes will be using UTF-8. Although Andrew is correct that technically it's just a byte string, if you don't match the system locale some programs may not work correctly and it will be impossible to get correct user input, etc. It's best to stick with UTF-8.

While another answer says the following:

Filesystem calls on Linux are encoding-agnostic, i.e. they do not (need to) know about the particular encoding. As far as they are concerned, the byte-string pointed to by the filename argument is passed down to the filesystem as-is. The filesystem expects that filenames are in the correct encoding (usually UTF-8, as mentioned by Matthew Talbert).

This means that you often don't need to do anything (filenames are treated as opaque byte-strings), but it really depends on where you receive the filename from, and whether you need to manipulate the filename in any way.

Which answer is the correct one?

Steve
  • 705
  • 5
  • 13
  • 2
    The first answer you quuote is plainly wrong. It does not depend on the "system locale". There is no such thing as system locale to begin with. There are only user locales. – n. m. could be an AI Jun 11 '18 at 05:42

1 Answers1

4

They're both correct in some ways.

The strings passed to the file system calls are a string of bytes, with a null byte marking the end of the string and '/' used to separate path components. Within the file name segments, the meaning of the bytes is immaterial to the file system — they're just a sequence of bytes.

How the bytes that form the file name are displayed depends on the equipment used to display them. If the names use UTF-8 with non-ASCII characters, printing that data using ISO 8859-15 (or 8859-1 for intransigent residents of the USA) yields gibberish, often including C1 control bytes from the byte range 0x80 .. 0x9F. If the names use 8859-15 with non-ASCII characters, there will be sequences that are not valid UTF-8 and you will get illegible or meaningless data displayed (question marks, or other indications of invalid UTF-8 sequences).

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 1
    The exception is filesystems that store names in 16-bit characters; then the mount option `iocharset` tells how the 16-bit characters are converted to 8-bit character sequences, and vice versa. (This affects long filenames on VFAT, for example. NTFS uses `nls` mount option for the same instead, because nothing MS does is standard if it can avoid it, apparently.) – Nominal Animal Jun 11 '18 at 05:53
  • 1
    @NominalAnimal: how exactly is MS involved in the naming of Linux file system mount options? Neither the VFAT driver nor the NTFS3G one come from Microsoft. This is more the case of "on Linux no two people do the same thing in a compatible way across implementations" - just ask the poor FreeDesktop guys, who try to put some order in the DE mess. – Matteo Italia Jun 11 '18 at 06:16
  • @MatteoItalia: No, you are wrong. NTFS initially used the `iocharset` mount option, but it was soon deprecated in favor of the `nls` option. The Linux NTFS driver *could* use both options, but the code specifically checks for `iocharset` to display a deprecation error. The only reason to use nls= instead of iocharset= like all other filesystems use, is to appease Microsoft, who use "NLS" and "codepage" rather than character set in their documentation. NTFS devs seemed to been more interested in working with Microsoft than Linux: see e.g. [Tuxera](https://en.wikipedia.org/wiki/Tuxera) history. – Nominal Animal Jun 11 '18 at 06:56
  • (If it is unclear to anyone, Anton Altaparmakov is the Linux NTFS fs driver maintainer, and [Tuxera.com is marked as the maintainer website](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/MAINTAINERS#n10135). Good luck trying to get any patch through that would even *allow* the use of iocharset mount option, to unify it across fs'es. Won't happen; will be nixed faster than you can write the five-line patch for it. Try it, please.) – Nominal Animal Jun 11 '18 at 07:05
  • It's actually interesting, the option is nominally deprecated since before 2005 (that's as far as the git history for the incriminated lines goes), but today - 13+ years after - it still works fine, it just spits out an annoying message; they weren't that serious with that deprecation. Removing the annoying message would really be a 5 lines patch indeed, I'm surprised nobody bothered, I'll try to ask on LKML. I'm still not convinced about the conspiracy theory, I've seen enough petty quarrels about "the right thing" to think that we are more than capable to self-inflict damage of this kind. – Matteo Italia Jun 11 '18 at 07:29
  • @NominalAnimal: that being said, this is all extremely off-topic and I think a moderator will come here to kill all those comments, which will be lost like tears in rain; anyway, thank you for letting me know of yet another little sad story of my favorite kernel. :-( – Matteo Italia Jun 11 '18 at 07:32