0

I have a web service receiving a upload text file. So on the server side, I got an InputStream object, and I try to wrap it as an InputStreamReader with "UTF8" as the charset. But I notice when I upload a file encoded in US-ASCII can also work. It seems Java can automatically transform the file from all other charset to UTF8. Am I right? How does the charset attribute work?

ROMANIA_engineer
  • 54,432
  • 29
  • 203
  • 199
Ensom Hodder
  • 1,522
  • 5
  • 18
  • 35

5 Answers5

1

UTF-8 is a super set of US-ASCII

ASCII is 7-bit characters (0 to 127) and these are unchanged in US-ASCII and UTF-8 and many other character sets. Where most character sets differ is the high bit bytes (128 to 255) In the case of US-ASCII it is undefined, for ISO-8859-1 these characters are unchanged allowing characters up to 255, in UTF-8, the characters are encoded to use 2 to 4 bytes so it can represent up to 0x10FFFF or 128K characters.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
0

No, Java does not usually automagically transform one character set into another, especially if you tell it explicitly which character set to use.

The thing is, however that UTF-8 is ASCII-compatible. That means that every valid ASCII stream is automatically a valid UTF-8 stream as well and that text containing only ASCII-characters encoded in UTF-8 is also valid ASCII.

So if you plan to accept only ASCII and UTF-8 input, then treating it all as UTF-8 is perfectly valid. If you plan to support other encodings as well, then you'll need some way to transmit the information about the actual encoding being used as well.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
0

This only works because US-ASCII is a subset of UTF-8 (every ASCII file is also a valid UTF-8 file of the same data).

Try with something else, and it will break.

Thilo
  • 257,207
  • 101
  • 511
  • 656
0

UTF-8 is compatible with ASCII i.e. every ASCII document is also valid UTF-8. Quoting Wikipedia:

[UTF-8] was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.

[...] The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, making valid ASCII text valid UTF-8-encoded Unicode as well.

So Java still treats your stream as UTF-8. If you try to consume UTF-16 or UTF-32 with UTF-8 reader you'll get garbage.

Community
  • 1
  • 1
Tomasz Nurkiewicz
  • 334,321
  • 69
  • 703
  • 674
  • In fact, I have a **catch** block for UnsupportedEncodingException, and I do try a UTF-16 encoded file. The strange thing is that it doesn't not throw this exception as expected. – Ensom Hodder Aug 30 '12 at 09:11
  • @EnsomHodder: "*`UnsupportedEncodingException` - If the named charset is not supported*" - this exception will be thrown if you use encoding unsupported by the JVM/OS. Try `UTF-42` or `FOO-7` – Tomasz Nurkiewicz Aug 30 '12 at 09:24
  • Yes, I noticed that the exception is thrown only when the given charset name is not supported. The JVM itself will not try to detect the encoding schema, but it uses the given encoding charset to parse the files. That's my understanding . – Ensom Hodder Aug 30 '12 at 09:40
0

Why? If you're uploading files just use the InputStream. You don't want to mess about transforming the file data into UTF-16 and then back into possibly a different encoding again.

Just copy the bytes.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • In fact, I have to parse the uploaded file, and use the content (texts) for some analysis. So, obviously, I need to use the correct charset the parse the file, right? – Ensom Hodder Aug 30 '12 at 10:32