Java UTF-16 Encoding code

Question

The function that encodes a Unicode Code Point (Integer) to a char array (Bytes) in java is basically this:

return new char[] { (char) codePoint };

Which is just a cast from the integer value to a char.

I would like to know how this cast is actually done, the code behind that cast to make the conversion from an integer value to a character encoded in UTF-16. I tried looking for it on the java source codes but with no luck.

In Java, a `char`is not a "byte"-sized (8 bit) entity, but a two-byte value. — Dirk, May 03 '11 at 20:30

score 9 · Answer 1 · edited Oct 07 '21 at 05:51

9

I'm not sure which function you're talking about.

Casting valid int code points to char will work for code points in the basic multilingual plane just due to how UTF-16 was defined. To convert anything above U+FFFF you should use Character.toChars(int) to convert to UTF-16 code units. The algorithm is defined in RFC 2781.

edited Oct 07 '21 at 05:51

Community

1
1

answered May 03 '11 at 20:33

McDowell

107,573
31
204
267

Because of surrogate pairs, not all values of `char` represent valid code-points (outside of a pair) -- even if all values of `char` are valid numbers. E.g. it's not just "anything above 0xffff". +1 For inclusion of the conversion method (which answers the question) and link, however. – May 03 '11 at 20:35
@pst - in case it is not apparent, "anything" in this case means a valid Unicode code point as defined in the spec (Unicode 4 for Java 6). – McDowell May 03 '11 at 20:39
@pst - from Unicode 4, [chapter 2](http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf): _the lowest plane, the Basic Multilingual Plane, consists of the range 0000..FFFF._ (numbers are base 16) – McDowell May 03 '11 at 20:48
2

All numbers representable by `char` are valid code points, but not all are valid scalar values. (Scalar values are code points which are not surrogate code points.) So it is true that not all `char` values are scalar values, and not all possible `char` sequences are UTF-16 strings. `Character.toChars`, however, does not check whether the argument is a valid scalar value. – Philipp May 03 '11 at 20:50
@pst - ah, I see what you meant in your deleted comment - you were quoting the algorithm: _If U < 0x10000, encode U as a 16-bit unsigned integer and terminate._ I will amend the answer. – McDowell May 03 '11 at 20:52
@McDowell It makes sense now -- the thing I was forgetting was some code-points ("surrogates area") are reserved for the purpose of encoding the surrogate pairs (and thus not valid as code-points otherwise). Thanks for the links. – May 03 '11 at 20:56

score 0 · Answer 2 · answered May 03 '11 at 20:26

0

The code point is just a number that maps to a character, there's no real conversion going on. Unicode code points are specified in hexadecimal, so whatever you codePoint is in hex will map to that character (or glyph).

answered May 03 '11 at 20:26

dfb

13,133
2
31
52

(Or map to a surrogate pair of `char` ...) – May 03 '11 at 20:36

score 0 · Answer 3 · answered May 03 '11 at 20:28

0

Since a char is defined to hold UTF-16 data in Java, this is all there is to it. Only if the input is an int (i.e. it can represent a Unicode codepoint of U+10000 or greater) is some calculation necessary. All char values are already UTF-16.

answered May 03 '11 at 20:28

Joachim Sauer

302,674
57
556
614

Not necessarily true. A `char` is just a 16-bit value (no notion of surrogate pairs by itself). Good for pointing out the range of Unicode vs. `char`, though. – May 03 '11 at 20:34

score 0 · Answer 4 · answered May 03 '11 at 20:28

0

All chars in Java are represented internally in UTF-16. This is just mapping the integer value to that char.

answered May 03 '11 at 20:28

Abdullah Jibaly

53,220
42
124
197

1

Not necessarily true. A `char` is just a 16-bit value (no notion of surrogate pairs by itself). Perhaps should talk about character literals in this context? – May 03 '11 at 20:32

score 0 · Answer 5 · answered May 03 '11 at 20:29

0

Also, char arrays are already UTF-16, in the Java platform.

answered May 03 '11 at 20:29

igordc

1,525
1
14
20

Not necessarily true. Even though a `char` is 16 bits, an array of characters can hold data which is not valid UTF-16 (invalid surrogate pairs, for instance). Not all code-points fit in a `char`. – May 03 '11 at 20:31
Right, I meant each char of an array of chars is UTF-16, because of skiforfun's array, however I might have not correctly understood his question. – igordc May 03 '11 at 22:35

Java UTF-16 Encoding code

5 Answers5