2

A question on reading text files in Java. I have a text file saved with UTF-8 encoding with only the content:

Hello. World.

Now I am using a RandomAccessFile to read this class. But for some reason, there seems to be an "invisible" character at the beginning of the file ...?

I use this code:

File file = new File("resources/texts/books/testfile2.txt");
try(RandomAccessFile reader = new RandomAccessFile(file, "r")) {

    String readLine = reader.readLine();
    String utf8Line = new String(readLine.getBytes("ISO-8859-1"), "UTF-8" );
    System.out.println("Read Line: " + readLine);
    System.out.println("Real length: " + readLine.length());
    System.out.println("UTF-8 Line: " + utf8Line);
    System.out.println("UTF-8 length: " + utf8Line.length());
    System.out.println("Current position: " + reader.getFilePointer());
} catch (Exception e) {
    e.printStackTrace();
}

The output is this:

Read Line: ?»?Hello. World.
Real length: 16
UTF-8 Line: ?Hello. World.
UTF-8 length: 14
Current position: 16

These (1 or 2) characters seem to appear only at the very beginning. If I add more lines to the file and read them, then all the further lines are being read normally. Can someone explain this behavior? What is this character at the beginning?

Thanks!

DanielBK
  • 892
  • 8
  • 23

1 Answers1

3

The first 3 bytes in your file (0xef, 0xbb, 0xbf) is so called UTF-8 BOM (Byte Order Mark). BOM is important for UTF-16 and UTF-32 only - for UTF-8 it has no meaning. Microsoft introduced it for the better guess of the file encoding.

That is, no all UTF-8 encoded text files have that mark, but some have.

MarianD
  • 13,096
  • 12
  • 42
  • 54
  • 1
    "*for UTF-8 it has no meaning*" - that is not true. It has the same meaning as a UTF-16 BOM or a UTF-32 BOM - to specify the file encoding. It is just that it only works in BOM-aware apps. For *backwards compatibility* with legacy apps, putting a BOM in a UTF-8 text file is not recommended for files that are to be used with legacy apps that won't know how to handle the BOM. – Remy Lebeau Aug 30 '18 at 23:55
  • @RemyLebeau, you aren't right, I'm sorry. BOM - as its name suggests - is about order of bytes in words / double words ("endianness"). Its utilization for a guest of the used codec *has nothing with its original intention*, and provide *nothing more than a guest*. It is the 2-bytes sequence `0xfeff` - Unicode ZERO WIDTH NO-BREAK SPACE symbol, U+FEFF. In contrast to it, 3-bytes sequence `0xefbbbf` has nothing with the order of bytes (as in UTF-8 the order of bytes is unambiguous), so its name BOM is silly. – MarianD Aug 31 '18 at 00:58
  • Per [unicode.org](http://unicode.org/faq/utf_bom.html): "*A byte order mark (BOM) ... **can be used as a signature** defining the byte order **and encoding form**... **A BOM can be used as a signature no matter how the Unicode text is transformed**: UTF-16, UTF-8, or UTF-32... the BOM serves to indicate both that it is a Unicode file, and **which of the formats it is in**... UTF-8 can contain a BOM... UTF-8 always has the same byte order. An initial BOM is **only used as a signature** - an indication that an otherwise unmarked text file is in UTF-8.*" – Remy Lebeau Aug 31 '18 at 03:53
  • @RemyLebeau, thanks, your quotes support what I wrote. – MarianD Aug 31 '18 at 17:18
  • 1
    My point was, despite its name, a BOM is *more* than just for indicating byte ordering. It is *ALSO* used as an encoding indicator, which applies to UTF-8 (and other single-byte UTFs, such as UTF-7), not just to UTF-16/32. The Unicode Consortium confirms and documents it as such. – Remy Lebeau Aug 31 '18 at 17:21
  • @RemyLebeau, yes, I agree with you. – MarianD Aug 31 '18 at 17:48