3

When an I/O stream manages 8-bit bytes of raw binary data, it is called a byte stream. And, when the I/O stream manages 16-bit Unicode characters, it is called a character stream.

Byte stream is clear. It uses 8-bit bytes. So if I were to write a character that uses 3 bytes it would only write its last 8 bits! Thus making incorrect output.

So that is why we use character streams. Say I want to write Latin Capital Letter . I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store.

Are you seeing pattern? We can't know how much bytes it will take for writing any of these characters until we convert them. So my question is why is it said that character streams manage 16-bit Unicode characters? When in case where I wrote that takes 3 bytes it didn't cut it to last 16-bits like byte streams cut last 8-bits. What does that quote even mean then?

Stefan
  • 969
  • 6
  • 9
  • Great question. I always struggled with that also! Hope some can make a good answer. – Ana Maria Sep 07 '20 at 01:30
  • 16-bit Unicode characters are not stored in UTF-8. They're stored in UTF-16. The character stream is responsible for doing the conversion. – Louis Wasserman Sep 07 '20 at 01:32
  • @LouisWasserman Hmmm you make it even more complicated hahah. What storing are you talking about? Do you mean actual written data to a file? Well, as I demonstrated Latin Capital Letter takes 24 bits. And yet it stored it nicely. Even tho you said it would store only 16-bit. – Stefan Sep 07 '20 at 01:38
  • 2
    No. In Java, a `char` is 16 bits, representing a UTF-16 character. In UTF-16, Ạ takes only 16 bits. You correctly demonstrated that it takes 24 bits _in UTF-8_. – Louis Wasserman Sep 07 '20 at 01:48
  • 1
    As per Louis' comment, you are confusing UTF systems. UTF-8 is NOT native to Java - UTF16 is the native Java character encoding system. If you want UTF-8, you'll need to use an inputstreamreader or similar which processes a single byte at a time and produces UTF16 characters. – John Sep 07 '20 at 01:53
  • But I didn't do any 'casting to UTF-8'. I used simple FileWriter.write(). So why it didn't use utf-16 if it is native as you say? – Stefan Sep 07 '20 at 01:57
  • 1
    UTF-16 is native to Java, not necessarily your computer system, and when writing to files without a specified charset Java will use your computer's default encoding. – Louis Wasserman Sep 07 '20 at 02:02
  • @LouisWasserman Ohh now I get it. So in figure of speech, computer's default encoding will override utf-16 that is default for Java? – Stefan Sep 07 '20 at 02:04
  • That's correct. – Louis Wasserman Sep 07 '20 at 02:09
  • 2
    When you quote something, you should name the source you’re citing. – Holger Sep 07 '20 at 07:54

2 Answers2

3

In Java, a String is composed of a sequence of 16-bit chars, representing text stored in the UTF-16 encoding.

A Charset is an object that describes how to convert Unicode characters to a sequence of bytes. UTF-8 is an example of a charset.

A character stream like Writer, when it outputs to a thing that contains bytes -- a file, or a byte output stream like OutputStream -- uses a Charset to convert Strings to simple byte sequences for output. (Technically, it converts the UTF-16 chars to Unicode characters and then converts those to byte sequences with the Charset.) A Reader, when reading from a byte source, does the reverse conversion.

In UTF-16, Ạ is represented as the 16-bit char 0x1EA1. It takes only 16 bits in UTF-16, not 24 bits as in UTF-8.

If you converted it to bytes with the UTF-8 encoding, as here:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, StandardCharsets.UTF_8);
writer.write("Ạ");
writer.close();
return baos.toByteArray();

Then you would get the 3 byte sequence 0xE1 0xBA 0xA1 as expected.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • 1
    It’s better to say that the *API* of the `String` class is designed around processing the characters in terms of 16 bit `char` units (and historically assuming such a storage). This makes implementations using different storage internally harder but not impossible. The reference implementation does already use a different storage for strings composed of only latin-1 characters (since JDK 9). Further, not every `Writer` does the named conversion, only those writing bytes to an `OutputStream` or a `ByteChannel`. – Holger Sep 07 '20 at 08:00
  • The internal implementation details of the `String` API don't seem particularly important here, only its public API. – Louis Wasserman Sep 07 '20 at 18:21
  • 1
    Exactly. That’s why it is better to say “its public API”. – Holger Sep 08 '20 at 06:17
2

In Java, a character (char) is always 16 bits, as can be seen from its max value - 65535. This is why the quote is not wrong. 16 bit is indeed a character.

"How can all the Unicode characters be stored in just 16 bits?" you might ask. This is done in Java using the UTF-16 encoding. Here's how it works (in very simplified terms):

Every Unicode code point in the Basic Multilingual Plane is encoded in 16 bits. (Yes 16 bit is enough for that) Every code point outside of the BMP is encoded with a pair of 16 bit characters, called surrogate pairs.

"Ạ" (U+1EA0) is inside the BMP, so can be encoded with 16 bits.

You said:

Say I want to write Latin Capital Letter Ạ. I would need 3 bytes for storing in UTF-8. But say I also want to store 'normal' A. Now it would take 1 byte to store!

That does not make the quote incorrect. The stream still "manages 16-bit characters", because that's what you will give it with Java code. When you call println on a PrintStream, you are giving it a String, which is a bunch of chars under the hood, which is a bunch of 16-bits. So it is really managing a stream of 16-bit characters. It's just that it outputs them in a different encoding.

It's probably worth mentioning what happens when you try to print a character that is not in the BMP. This would still not make the quote incorrect. The quote does not say "code point". It says "character" which would refer to the upper/lower surrogates of the surrogate pair that you are printing.

Sweeper
  • 213,210
  • 22
  • 193
  • 313