Why does Java's CharsetEncoder define .onMalformedInput()/CharsetDecoder define .onUnmappableCharacter()?

Question

A CharsetDecoder basically helps decoding a sequence of bytes into a sequence of chars (see Charset#newDecoder()). On the opposite side, a CharsetEncoder (see Charset#newEncoder()) does the reverse: take a sequence of chars, and encode them into a sequence of bytes.

CharsetDecoder defines .onMalformedInput() and it seems logical (some byte sequence may not translate to a valid char sequence); but why .onUnmappableCharacter() since its input is a byte sequence?

Similarly, CharsetEncoder defines .onUnmappableCharacter() which is, here again, logical (for instance, if your charset is ASCII, you cannot encode ö); but why does it also define .onMalformedInput() since its input is a character sequence?

This is all the more intriguing that you cannot obtain an encoder from a decoder and vice versa, and none of these two classes seem to share a common ancestor...

EDIT 1

It is indeed possible to trigger .onMalformedInput() on a CharsetEncoder. You "just" have to provide an illegal char or char sequence. The program below relies on the fact that in UTF-16, a high surrogate must be followed by a low surrogate; here, a two-element char array is built with two high surrogates instead and an attempt to encode it is done. NOTE how the creation of a String from such an ill-formed char sequence throws no exception at all:

Code:

public static void main(final String... args)
    throws CharacterCodingException
{
    boolean found = false;
    char c = '.';

    for (int i = 0; i < 65536; i++) {
        if (Character.isHighSurrogate((char) i)) {
            c = (char) i;
            found = true;
            break;
        }
    }
    if (!found)
        throw new IllegalStateException();

    System.out.println("found: " + Integer.toHexString(c));
    final char[] foo = { c, c };

    new String(foo); // <-- DOES NOT THROW AN EXCEPTION!!!

    final CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder()
        .onMalformedInput(CodingErrorAction.REPORT);

    encoder.encode(CharBuffer.wrap(foo));
}

Output:

found: d800
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
    at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:798)
    at com.github.fge.largetext.LargeText.main(LargeText.java:166)

EDIT 2 But now, how about the reverse? From @Kairos's answer below, quoting the manpage:

UnmappableCharacterException - If the byte sequence starting at the input buffer's current position cannot be mapped to an equivalent character sequence and the current unmappable-character action is CodingErrorAction.REPORT

Now, what is "cannot be mapped to an equivalent character sequence"?

I play quite a bit with CharsetDecoders in this project and have yet to produce such an error. I know how to reproduce an error in which, for instance, you only have two bytes out of a three-byte UTF-8 sequence but this triggers a MalformedInputException. All you have to do in this case is restart the decoding from the last known position of the ByteBuffer.

Triggering an UnmappableCharacterException would basically mean that the character encoding itself would allow for an illegal char to be generated; or an illegal Unicode code point.

Is this possible at all?

You bring up a good point. There are few examples even using the method, much less _why_ it is implemented...still looking though. — Reuben Tanner, Apr 05 '14 at 21:10
@Kairos a real world example can be found in Java 7's `UnixPath` (abstract `Path` implementation of JDK 7 for Unix systems): it uses a `CharsetEncoder` with `.onUnmappableCharacter(CodingErrorAction.REPORT)`; if the `String` you provide cannot be encoded according to your `Charset.defaultCharset()`, `Paths.get()` will throw an `InvalidPathException` if the encoding detected a character (or sequence of) not mappable to a byte sequence — fge, Apr 05 '14 at 21:32

Reuben Tanner · Accepted Answer · 2014-04-05T21:36:41.173

4

Per the docs for CharsetEncoder.encode() it states that it throws a MalformedInputException

If the character sequence starting at the input buffer's current position is not a legal sixteen-bit Unicode sequence and the current malformed-input action is CodingErrorAction.REPORT

So, you are given the option of providing a CodingErrorAction by utilizing onMalformedInput so that if you encounter one of these illegal sixteen-bit Unicode sequences, the provided action will be performed.

Similarly for CharsetDecoder.decode()

UnmappableCharacterException - If the byte sequence starting at the input buffer's current position cannot be mapped to an equivalent character sequence and the current unmappable-character action is CodingErrorAction.REPORT

edited Apr 05 '14 at 21:36

answered Apr 05 '14 at 21:16

Reuben Tanner

5,229
3
31
46

An example? Not sure, maybe a string starting with a combining diacritical mark, normally _after_ the: `c` + combining-`^` (`"\u0302"`) = `ĉ` – Joop Eggen Apr 05 '14 at 21:27
Uh, OK, I can see that `.onMalformedInput()` can relate to `char`s, for instance, if you have two `char`s following one another which are an invalid (BE) UTF-16 sequence, for instance, two high surrogates following one another. +1 for that. But how about the reverse? How can you trigger `.onUnmappableCharacter()` from a `CharsetDecoder`? – fge Apr 05 '14 at 21:30
@fge You yourself have said it already "if your charset is ASCII, you cannot encode ö". This would be a fitting example during CharsetDecoder.decode() as well. – Reuben Tanner Apr 05 '14 at 21:37
@Kairos that would be a `CharsetEncoder` which is used here, not a `CharsetDecoder`; and in this case the triggered action would have been `.onUnmappableCharacter()` -- but on an encoder, not a decoder – fge Apr 05 '14 at 21:39
@fge, ah I see. Your above comment asks for it in CharsetDecoder. – Reuben Tanner Apr 05 '14 at 21:42
@Kairos well yes -- how can you have an unmappable _character_ while decoding a _byte_ array? Since a decoder tries and turns a byte sequence into a legal char sequence – fge Apr 05 '14 at 21:42
@fge, it's not quite a byte _array_ exactly, decode takes a ByteBuffer which can be created from chars with ByteBuffer.putChar(). – Reuben Tanner Apr 05 '14 at 21:45
Oh, also -- the second extract says "If the byte sequence starting at the input buffer's current position cannot be mapped to an equivalent character sequence and the current unmappable-character action is CodingErrorAction.REPORT" <-- OK, but how does it distinguish a short read from an actual unmappable sequence? Is there such a character encoding? – fge Apr 05 '14 at 21:45
@Kairos err no, doesn't quite work that way ;) A `char` is not two `byte`s! – fge Apr 05 '14 at 21:45
@Kairos just confirmed in the question edit that it is possible to trigger `.onMalformedInput()` on a `CharsetEncoder`, pity I can't +1 more – fge Apr 05 '14 at 22:12
@Kairos I'd love to, except that I want to know how to trigger the reverse too! – fge Apr 05 '14 at 22:21
@fge, and that was to get CharsetDecoder to throw an UnmappableCharacterException? – Reuben Tanner Apr 05 '14 at 22:25
@Kairos yup... See question edit. I have no clue how to trigger that at the moment – fge Apr 05 '14 at 22:30
@Kairos I accepted your answer since it answers half of my problem... The other half I have already asked in another question [here](http://stackoverflow.com/q/22022145/1093528) – fge Apr 05 '14 at 23:10
@fge, thanks! I have an idea for your second part and am attempting it now. – Reuben Tanner Apr 05 '14 at 23:19
@Kairos if you can answer it, the question linked above will grant +500 for what you find it worth :p I really wish to have an explanative answer – fge Apr 05 '14 at 23:29
@fge, you've doomed me to no longer get anything done until i solve this problem :-\. now, why'd you go and do that :-P – Reuben Tanner Apr 05 '14 at 23:32
@Kairos don't feel compelled to answer either, heh :p – fge Apr 05 '14 at 23:52

Why does Java's CharsetEncoder define .onMalformedInput()/CharsetDecoder define .onUnmappableCharacter()?

1 Answers1