encoding latin1 to UTF8 fails

Question

We have a test-file (csv) for imports that is encoded as latin1 (as vim reports).

We have changed file.encoding and client.file.encoding in websphere to UTF-8.

Now the same file is rejected with "MalformedInputException" in sun.io.ByteToCharUTF8.convert

Why?

I assumed that UTF8 is a superset of latin1. So perhaps some bytes might be misinterpreted, but there shouldnt be an exception, as we broaden the charset ?

What else could be the cause for this "MalformedInputException" ?

Set of characters representable by UTF-8 is a superset of the set of characters representable by Latin1. However, UTF-8 **encoding** is incompatible with Latin-1 encoding (they match for characters with codes < 128, but Latin-1 characters >= 128 have different representation in UTF-8 and yes, their codes are not well-formed UTF-8 code points). — atzz, Sep 10 '12 at 10:33

score 2 · Accepted Answer · answered Sep 10 '12 at 10:33

2

UTF-8 is a superset of ASCII, but not of latin-1 (which is a different superset of ASCII). All characters in the range 0-127 are equal in UTF-8 and ASCII, but Latin-1 also defines many characters in the range 128-255, and these might cause problems when interpreted as UTF-8.

answered Sep 10 '12 at 10:33

Aasmund Eldhuset

37,289
4
68
81

but misinterpreting would not throw the exception? Just display the bytes as different chars lateron !? – Bastl Sep 10 '12 at 11:14
1

@Bastl: In UTF-8, bytes where the most significant bit is 1 (that is, bytes in the range 128-255) indicate a character that is represented by multiple bytes, and there are certain [rules as to the structure of those bytes](http://en.wikipedia.org/wiki/Utf-8#Description). Random latin-1 characters will likely violate those rules and be _invalid_ (as opposed to representing a valid, but different character). Anyways: why would you want characters in your document to be incorrectly interpreted? Is there anything that prevents you from reading the file as latin-1? – Aasmund Eldhuset Sep 10 '12 at 11:25
1

@Bastl: I know the feeling. :-) You might want to read [Joel Spolsky's excellent article on character sets and encodings](http://www.joelonsoftware.com/articles/Unicode.html), by the way. – Aasmund Eldhuset Sep 10 '12 at 12:16
1

@Bastl: Also, from [RFC 3629](http://tools.ietf.org/html/rfc3629): "Implementations of the decoding algorithm above MUST protect against decoding invalid sequences." So any invalid byte sequence _will_ cause an exception in a conforming UTF-8 decoder. – Aasmund Eldhuset Sep 10 '12 at 13:29

encoding latin1 to UTF8 fails

1 Answers1