Why does Java StandardCharsets provide three "UTF-16" encoding types but Notepad only provides 2 options (BE & LE)?

Question

StandardCharsets provides three entries for UTF-16:

    /**
     * Sixteen-bit UCS Transformation Format, big-endian byte order
     */
    public static final Charset UTF_16BE = new sun.nio.cs.UTF_16BE();
    /**
     * Sixteen-bit UCS Transformation Format, little-endian byte order
     */
    public static final Charset UTF_16LE = new sun.nio.cs.UTF_16LE();
    /**
     * Sixteen-bit UCS Transformation Format, byte order identified by an
     * optional byte-order mark
     */
    public static final Charset UTF_16 = new sun.nio.cs.UTF_16();

Notepad (& Notepad++) Provides following:

Why is UTF-16 missing in Notepad? (Is UTF-16 and UTF-16 BE same thing?)

I don't use notepad or notepad++, but are those options for decoding/reading/opening the text file, or is it for encoding/writing/saving the file? — Sweeper, Feb 06 '23 at 07:13
In Notepad, While Saving ; and in Notepad++, while displaying file can check encoding options — fatherazrael, Feb 06 '23 at 07:21

score 1 · Answer 1 · answered Feb 06 '23 at 07:36

These are documented here:

When decoding, the UTF-16BE and UTF-16LE charsets interpret the initial byte-order marks as a ZERO-WIDTH NON-BREAKING SPACE; when encoding, they do not write byte-order marks.

When decoding, the UTF-16 charset interprets the byte-order mark at the beginning of the input stream to indicate the byte-order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

So in short, UTF_16BE and UTF_16LE do not care about BOM, and so does not correspond to the "UTF-16 BE BOM" or "UTF-16 LE BOM" options in notepad++ as you seem to imply.

On the other hand, UTF_16 does write a BE BOM when encoding, so would correspond to choosing the "(Convert to) UTF-16 BE BOM" option in notepad++. Note that for decoding, the BOM is "optional".

As for the NotePad options, they do not say whether they include a BOM, so I'm not sure if they do. If they do not, then it would be equivalent to UTF_16BE and UTF_16LE's encoding behaviour.

As for why notepad++ does not have the equivalent of the UTF_16BE and UTF_16LE options, or why Java doesn't have a "UTF-16 LE BOM" option, it is not really a useful question to ask. As Eric Lippert said,

features are not magically implemented by default and then the implementations have to get removed by the development team for a good reason. Rather, all features are unimplemented by default and have to be thought of, designed, specified, implemented, tested, approved and shipped to customers. All that costs time and effort.

For anyone that wants to see the inclusion or omission of the BOM, here is a bit of example code: `Files.writeString( Paths.get( "/Users/whatever/bogus_UTF_8.txt" ) , "UTF_8" , StandardCharsets.UTF_8 ); Files.writeString( Paths.get( "/Users/whatever/bogus_UTF_16.txt" ) , "UTF_16" , StandardCharsets.UTF_16 ); Files.writeString( Paths.get( "/Users/whatever/bogus_UTF_16BE.txt" ) , "UTF_16BE" , StandardCharsets.UTF_16BE ); Files.writeString( Paths.get( "/Users/whatever/bogus_UTF_16LE.txt" ) , "UTF_16LE" , StandardCharsets.UTF_16LE );` — Basil Bourque, Feb 06 '23 at 08:10

Why does Java StandardCharsets provide three "UTF-16" encoding types but Notepad only provides 2 options (BE & LE)?

1 Answers1