4

Since Java char is 16 bit long, I am wondering how can it represent the full unicode code point? It can only represent 65536 code points, is that right?

Cœur
  • 37,241
  • 25
  • 195
  • 267
user705414
  • 20,472
  • 39
  • 112
  • 155

2 Answers2

9

Yes, a Java char is a UTF-16 code unit. If you need to represent Unicode characters outside the Basic Multilingual Plane, you need to use surrogate pairs within a java.lang.String. The String class provides various methods to work with full Unicode code points, such as codePointAt(index).

From section 3.1 of the Java Language Specification:

The Unicode standard was originally designed as a fixed-width 16-bit character encoding. It has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, using the hexadecimal U+n notation. Characters whose code points are greater than U+FFFF are called supplementary characters. To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary characters are represented as pairs of 16-bit code units, the first from the high-surrogates range, (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding. A few APIs, primarily in the Character class, use 32-bit integers to represent code points as individual entities. The Java platform provides methods to convert between the two representations.

See the Character docs for more information.

Community
  • 1
  • 1
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
2

One char, which is unsigned 16 bits, can represent any code point up to 0xFFFF, but not supplemental characters, which are larger. Java is best thought of as using UTF-16 encoding in char, so, supplemental characters are actually represented as pairs of char, a surrogate pair. While one char can't represent such supplemental characters, Java does handle it.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173