0

When I create a file with UTF-8 encoding in Java, and I open it in Notepad or Notepad++ afterwards it says it is ANSI encoded. How come?

File file = new File("path\to\file");
file.createNewFile();
Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), StandardCharsets.UTF_8));

writer.write("something");
writer.flush();
writer.close();

If I write some special characters like Æ. Ø or Å to the file, then notepad says it is UTF-8 encoded. Why is this?

Is ANSI and UTF-8 byte representation the same if no special characters is included?

sjallamander
  • 439
  • 2
  • 6
  • 20
  • Since you use Java 7+, you should use [java.nio.file](http://java7fs.wikia.com/wiki/Using_the_java.nio.file_API) instead of `File` – fge Jan 15 '15 at 08:03

1 Answers1

3

UTF-8 and ANSI have similar byte-encoding for the first 127 characters [1]. So if you do not use any other characters, there is no way to tell the difference.

The only way to tell it is UTF-8 is to add a Byte-Order-Mark, which is a set of special crafted bytes that markt the encoding of a file:

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF.

[1] The unicode characters U+0000..U+007F, which have binary representations in UTF-8 and ASCII as one byte, and all have highest bit 0.

Rob Audenaerde
  • 19,195
  • 10
  • 76
  • 121
  • 1
    Uhm, no, the BOM is not the only way; just use UTF-8 all the time and that's it – fge Jan 15 '15 at 08:02
  • @fge If you only use the lower 7 bits characters (ascii), ANSI and UTF-8 are exactly similar. – Rob Audenaerde Jan 15 '15 at 08:04
  • As to the BOM it is also a Unicode code point; [U+FEFF](http://www.fileformat.info/info/unicode/char/feff/index.htm) to be precise. – fge Jan 15 '15 at 08:05
  • Yes I know; which is why you might just as well do the sane thing and use UTF-8 all the time. – fge Jan 15 '15 at 08:05
  • Yes, we don't want the BOM. So ANSI and UTF-8 is encoded the same way if the file contains no special characters. Thats why notepad just guesses it is ANSI. – sjallamander Jan 15 '15 at 08:09
  • 2
    @sjallamander exactly; now, Notepad{,++} should probaby take a plunge into the 21st century and assume UTF-8 by default ;) – fge Jan 15 '15 at 08:12
  • 2
    Technically UTF-8 doesn't use nor need a BOM, it is allowed (but afaik discouraged) to use a BOM with UTF-8 for compatibility reasons only. A BOM is needed with UTF16 and UTF32 to discern LE (little endian) and BE (big endian), but UTF-8 doesn't have endianness problems as the byte order is always the same. – Mark Rotteveel Jan 15 '15 at 09:29