2

The function that encodes a Unicode Code Point (Integer) to a char array (Bytes) in java is basically this:

return new char[] { (char) codePoint };

Which is just a cast from the integer value to a char.

I would like to know how this cast is actually done, the code behind that cast to make the conversion from an integer value to a character encoded in UTF-16. I tried looking for it on the java source codes but with no luck.

skiforfun
  • 21
  • 1
  • 2

5 Answers5

9

I'm not sure which function you're talking about.

Casting valid int code points to char will work for code points in the basic multilingual plane just due to how UTF-16 was defined. To convert anything above U+FFFF you should use Character.toChars(int) to convert to UTF-16 code units. The algorithm is defined in RFC 2781.

Community
  • 1
  • 1
McDowell
  • 107,573
  • 31
  • 204
  • 267
  • Because of surrogate pairs, not all values of `char` represent valid code-points (outside of a pair) -- even if all values of `char` are valid numbers. E.g. it's not just "anything above 0xffff". +1 For inclusion of the conversion method (which answers the question) and link, however. –  May 03 '11 at 20:35
  • @pst - in case it is not apparent, "anything" in this case means a valid Unicode code point as defined in the spec (Unicode 4 for Java 6). – McDowell May 03 '11 at 20:39
  • @pst - from Unicode 4, [chapter 2](http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf): _the lowest plane, the Basic Multilingual Plane, consists of the range 0000..FFFF._ (numbers are base 16) – McDowell May 03 '11 at 20:48
  • 2
    All numbers representable by `char` are valid code points, but not all are valid scalar values. (Scalar values are code points which are not surrogate code points.) So it is true that not all `char` values are scalar values, and not all possible `char` sequences are UTF-16 strings. `Character.toChars`, however, does not check whether the argument is a valid scalar value. – Philipp May 03 '11 at 20:50
  • @pst - ah, I see what you meant in your deleted comment - you were quoting the algorithm: _If U < 0x10000, encode U as a 16-bit unsigned integer and terminate._ I will amend the answer. – McDowell May 03 '11 at 20:52
  • @McDowell It makes sense now -- the thing I was forgetting was some code-points ("surrogates area") are reserved for the purpose of encoding the surrogate pairs (and thus not valid as code-points otherwise). Thanks for the links. –  May 03 '11 at 20:56
0

The code point is just a number that maps to a character, there's no real conversion going on. Unicode code points are specified in hexadecimal, so whatever you codePoint is in hex will map to that character (or glyph).

dfb
  • 13,133
  • 2
  • 31
  • 52
0

Since a char is defined to hold UTF-16 data in Java, this is all there is to it. Only if the input is an int (i.e. it can represent a Unicode codepoint of U+10000 or greater) is some calculation necessary. All char values are already UTF-16.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • Not necessarily true. A `char` is just a 16-bit value (no notion of surrogate pairs by itself). Good for pointing out the range of Unicode vs. `char`, though. –  May 03 '11 at 20:34
0

All chars in Java are represented internally in UTF-16. This is just mapping the integer value to that char.

Abdullah Jibaly
  • 53,220
  • 42
  • 124
  • 197
  • 1
    Not necessarily true. A `char` is just a 16-bit value (no notion of surrogate pairs by itself). Perhaps should talk about character literals in this context? –  May 03 '11 at 20:32
0

Also, char arrays are already UTF-16, in the Java platform.

igordc
  • 1,525
  • 1
  • 14
  • 20
  • Not necessarily true. Even though a `char` is 16 bits, an array of characters can hold data which is not valid UTF-16 (invalid surrogate pairs, for instance). Not all code-points fit in a `char`. –  May 03 '11 at 20:31
  • Right, I meant each char of an array of chars is UTF-16, because of skiforfun's array, however I might have not correctly understood his question. – igordc May 03 '11 at 22:35