3

Consider the following code:

public class ReadingTest {

    public void readAndPrint(String usingEncoding) throws Exception {
        ByteArrayInputStream bais = new ByteArrayInputStream(new byte[]{(byte) 0xC2, (byte) 0xB5}); // 'micro' sign UTF-8 representation
        InputStreamReader isr = new InputStreamReader(bais, usingEncoding);
        char[] cbuf = new char[2];
        isr.read(cbuf);
        System.out.println(cbuf[0]+" "+(int) cbuf[0]);
    }

    public static void main(String[] argv) throws Exception {
        ReadingTest w = new ReadingTest();
        w.readAndPrint("UTF-8");
        w.readAndPrint("US-ASCII");
    }
}

Observed output:

ยต 181
? 65533

Why does the second call of readAndPrint() (the one using US-ASCII) succeed? I would expect it to throw an error, since the input is not a proper character in this encoding. What is the place in the Java API or JLS which mandates this behavior?

Grzegorz Oledzki
  • 23,614
  • 16
  • 68
  • 106

2 Answers2

9

The default operation when finding un-decodable bytes in the input-stream is to replace them with the Unicode Character U+FFFD REPLACEMENT CHARACTER.

If you want to change that, you can pass a CharacterDecoder to the InputStreamReader which has a different CodingErrorAction configured:

CharsetDecoder decoder = Charset.forName(usingEncoding).newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
InputStreamReader isr = new InputStreamReader(bais, decoder);
Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • Thanks for your answer. The problem is I can't easily change the code which creates the `InputStreamReader`, because it's not mine - `org.apache.tools.ant.taskdefs.SQLExec.Transaction.runTransaction(PrintStream)`. I was surprised to learn the `encoding` attribute of Ant's `` task doesn't prevent malformed input. โ€“ Grzegorz Oledzki Feb 03 '11 at 14:06
  • @Grzegorz: I'd consider that a bug. At least when `encoding` is specified, the task should enforce the encoding, including reporting errors. If it's unspecified, then it's probably better to be error-tolerant. Maybe adding a `strictEncoding` attribute or something like that would be appropriate. โ€“ Joachim Sauer Feb 03 '11 at 14:09
  • I've filed an issue in Ant's bug database: https://issues.apache.org/bugzilla/show_bug.cgi?id=50715 but I don't hope for the issue to be resolved soon. โ€“ Grzegorz Oledzki Feb 03 '11 at 14:28
3

I'd say, this is the same as for the constructor String(byte bytes[], int offset, int length, Charset charset):

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The java.nio.charset.CharsetDecoder class should be used when more control over the decoding process is required.

Using CharsetDecoder you can specify a different CodingErrorAction.

maaartinus
  • 44,714
  • 32
  • 161
  • 320