Why does US-ASCII encoding accept non US-ASCII characters?

Question

Consider the following code:

public class ReadingTest {

    public void readAndPrint(String usingEncoding) throws Exception {
        ByteArrayInputStream bais = new ByteArrayInputStream(new byte[]{(byte) 0xC2, (byte) 0xB5}); // 'micro' sign UTF-8 representation
        InputStreamReader isr = new InputStreamReader(bais, usingEncoding);
        char[] cbuf = new char[2];
        isr.read(cbuf);
        System.out.println(cbuf[0]+" "+(int) cbuf[0]);
    }

    public static void main(String[] argv) throws Exception {
        ReadingTest w = new ReadingTest();
        w.readAndPrint("UTF-8");
        w.readAndPrint("US-ASCII");
    }
}

Observed output:

µ 181
? 65533

Why does the second call of readAndPrint() (the one using US-ASCII) succeed? I would expect it to throw an error, since the input is not a proper character in this encoding. What is the place in the Java API or JLS which mandates this behavior?

score 9 · Accepted Answer · answered Feb 03 '11 at 13:08

9

The default operation when finding un-decodable bytes in the input-stream is to replace them with the Unicode Character U+FFFD REPLACEMENT CHARACTER.

If you want to change that, you can pass a CharacterDecoder to the InputStreamReader which has a different CodingErrorAction configured:

CharsetDecoder decoder = Charset.forName(usingEncoding).newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
InputStreamReader isr = new InputStreamReader(bais, decoder);

answered Feb 03 '11 at 13:08

Joachim Sauer

302,674
57
556
614

Thanks for your answer. The problem is I can't easily change the code which creates the `InputStreamReader`, because it's not mine - `org.apache.tools.ant.taskdefs.SQLExec.Transaction.runTransaction(PrintStream)`. I was surprised to learn the `encoding` attribute of Ant's `` task doesn't prevent malformed input. – Grzegorz Oledzki Feb 03 '11 at 14:06
@Grzegorz: I'd consider that a bug. At least when `encoding` is specified, the task should enforce the encoding, including reporting errors. If it's unspecified, then it's probably better to be error-tolerant. Maybe adding a `strictEncoding` attribute or something like that would be appropriate. – Joachim Sauer Feb 03 '11 at 14:09
I've filed an issue in Ant's bug database: https://issues.apache.org/bugzilla/show_bug.cgi?id=50715 but I don't hope for the issue to be resolved soon. – Grzegorz Oledzki Feb 03 '11 at 14:28

score 3 · Answer 2 · answered Feb 03 '11 at 13:09

I'd say, this is the same as for the constructor String(byte bytes[], int offset, int length, Charset charset):

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The java.nio.charset.CharsetDecoder class should be used when more control over the decoding process is required.

Using CharsetDecoder you can specify a different CodingErrorAction.

Why does US-ASCII encoding accept non US-ASCII characters?

2 Answers2