12

I know the short answer should be "nowhere", however there's something that doesn't quite add up in the following test 2.

Test 1. In Gedit, I create a new file containing only the string "aàbï", I choose "Save As" and there's a selector for choosing the character encoding. So I save it as "Unicode (UTF-8)", then I repeat the same and I save it to another file as "ISO-8859-15". The first file is 7 bytes in size (2 1-byte characters, 2 2-byte characters and a LF at the end of the file, as a hex dump shows). The second file is 5 bytes in size (4 1-byte characters in latin encoding plus a LF). This shows that the encoding is not stored anywhere in the file. Apparently, when I open the file in Gedit and it decodes it correctly, it must be figuring out how to decode it by analyzing the contents.

Test2. I do the same as above, but this time the contents of the file are just "abcd", that is four ascii characters. The two saved files have identical sizes (5 bytes) and identical hex dumps. It seems like the two files are identical, indistinguishable, so, again, it seems no information about the encoding is included in the files.

However, when I open the two files of test 2 again in Gedit, and I go to Save As, the encoding that the file was saved with is selected. Gedit somehow can tell that one file was encoded in UTF-8 and the other in ISO-8859-15, though both only contain ascii characters that result in the same byte sequence and they appear to be identical. How is that?

Is there some sort of metadata in the filesystem? Or is it just Gedit that has its own cache and remembers user choices for a given file that was already opened (and saved) with it on the same computer?

P.S. note that this question is related to programming even if I pose a non-programming test case, because this is about how a given type of files is encoded, whic affects how one would read, parse, decode, encode and write them from a program.

matteo
  • 2,934
  • 7
  • 44
  • 59
  • 2
    Probably that editor caches the encoding by file name, so that would be volatile and proprietary information, sometimes even miss leading. The character encoding of a plain text file is definitively _not_ stored anywhere. Actually the two files of your second example do not really have a different encoding. They contain simply two 4 character sequences of 7bit characters. Such strings are valid in most encodings. – arkascha Mar 27 '16 at 20:13
  • For the benefit of future readers coming across this question: I've seen UTF-8 files in the wild that started with a byte order marker u+FEFF, and software that used that as a hint that the content is some variation of Unicode. – Ulrich Schwarz May 16 '16 at 08:45
  • 1
    I have come across this same weird problem using xed - exactly the same thing. I have compared the files with multiple compare programs, some do binary compare, and all report the two files are byte for byte identical. I have changed the names of the files and copied them into a different directory, and xed still refuses to open one of them using UTF-8 encoding. The "byte order mark" is not in either file. This encoding information is "remembered" somewhere!!! Someone out there must know where or how. – Harvey Jan 30 '22 at 12:06

2 Answers2

10

It isn't, at least not by default. There's actually no difference between the way those two files containing abcd are stored in the filesystem, since the text string abcd is encoded identically in the ASCII subset of both locales.

Ext filesystems do not log file encoding metadata. While it is possible to record a limited amount of data (on the order of a few kilobytes) along with a file on an ext filesystem by using extended attributes, gedit apparently does not use this to store character encoding, and instead caches a specific user's selected encoding for specific files. You can demonstrate this by logging in as another user (I logged in as root for this experiment) and opening the same file -- gedit will read it using the default system locale, not the custom locale that you saved it in under the other login.

sig_seg_v
  • 570
  • 4
  • 15
  • 1
    Re: "In your second case, there's actually no difference between the encodings of the two files": Yes, that's the OP's point. (S)he intentionally created two files with identical contents. – ruakh Mar 27 '16 at 20:34
  • 2
    @ruakh edited for clarity & to answer the question more directly – sig_seg_v Mar 27 '16 at 20:38
2

The files' encoding is not stored as an attribute of the files. Instead, programs must examine the files to see which encoding is most suitable. Test1 is the interesting one, since the files are different:

  • starting with the assumption that the file is encoded in UTF-8, gedit will attempt to decode it as UTF-8
  • the ISO-8859-15 file contains bytes which are not valid UTF-8 encoding, so gedit will handle it as one of the ISO-8859-x variants
  • ISO-8859-15 differs from the others (such as ISO-8859-1) by its interpretation of the same data — which is not part of the sample.
  • lacking more information, whatever gedit tells you about the shorter one will reflect your locale settings and gedit configuration, but is essentially only a guess.

With Test2, both files use ASCII encoding (a subset of both UTF-8 and ISO-8859-15), so there is no additional information: gedit will again rely on your locale and its configuration if it wants to use the files as UTF-8 or not.

Further reading:

Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105