1

Given a stream of bytes (that represent characters) and the encoding of the stream, how would I obtain the code points of the characters?

InputStreamReader r = new InputStreamReader(bla, Charset.forName("UTF-8"));
int whatIsThis = r.read(); 

What is returned by read() in the above snippet? Is it the unicode codepoint?

McDowell
  • 107,573
  • 31
  • 204
  • 267
Vitaliy
  • 8,044
  • 7
  • 38
  • 66
  • FYI: Java 7 introduces the [`StandardCharsets`](http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html) constants to reduce your dependency on [stringly typed](http://www.codinghorror.com/blog/2012/07/new-programming-jargon.html) variables. – McDowell Jan 09 '13 at 20:56

2 Answers2

2

It does not read unicode code points, but UTF-16 code units. There is no difference for code points below 0xFFFF, but code points above 0xFFFF are represented by 2 code units each. This is because you cannot have value above 0xFFFF in 16-bit.

So is in this case:

byte[] a = {-16, -96, -128, -128}; //UTF-8 for  U+20000

ByteArrayInputStream is = new ByteArrayInputStream(a);
InputStreamReader r = new InputStreamReader(is, Charset.forName("UTF-8"));
int whatIsThis = r.read();
int whatIsThis2 = r.read();
System.out.println(whatIsThis); //55360 not a valid stand alone code point 
System.out.println(whatIsThis2); //56320 not a valid stand alone code point

With the surrogate values, we put them together to get 0x20000:

((55360 - 0xD800) * 0x400) + (56320 - 0xDC00) + 0x10000 == 0x20000

More about how UTF-16 works: http://en.wikipedia.org/wiki/UTF-16

Esailija
  • 138,174
  • 23
  • 272
  • 326
  • If this is true, this sounds as if the guys who wrote this did half a job.. after all, they allow me to specify the encoding, so I would expect that read() will return the actual codepoint. Why would anybodu be interested in half baked values? It is also not clear from the javadoc. Is there something that can spare me this extra manipulation? – Vitaliy Jan 08 '13 at 20:05
  • 1
    @Vitaliy it is true, these APIs were created back when 16 bits were enough to represent any unicode code point. Nowadays you use this hackish system to deal with code points above 16 bits. – Esailija Jan 08 '13 at 20:08
  • @Vitaliy if it helps, the unicode characters beyond 16-bit are very rare and not used in modern languages, except for rare names for the CJK idiographs afaik. Normal applications don't have to deal with them, I am just saying this for correctness. But yes, welcome to the matrix. Also see http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful – Esailija Jan 08 '13 at 20:11
2

Reader.read() returns a value that can be cast to char or -1 if no more data is available.

A char is (implicitly) a 16-bit code unit in the UTF-16BE encoding. This encoding can represent basic multilingual plane characters with a single char. The supplementary range is represented using two-char sequences.

The Character type contains methods for translating UTF-16 code units to Unicode code points:

A code point that requires two chars will satisfy the isHighSurrogate and isLowSurrogate when you pass in two sequential values from a sequence. The codePointAt methods can be used to extract code points from code unit sequences. There are similar methods for working from code points to UTF-16 code units.


A sample implementation of a code point stream reader:

import java.io.*;
public class CodePointReader implements Closeable {
  private final Reader charSource;
  private int codeUnit;

  public CodePointReader(Reader charSource) throws IOException {
    this.charSource = charSource;
    codeUnit = charSource.read();
  }

  public boolean hasNext() { return codeUnit != -1; }

  public int nextCodePoint() throws IOException {
    try {
      char high = (char) codeUnit;
      if (Character.isHighSurrogate(high)) {
        int next = charSource.read();
        if (next == -1) { throw new IOException("malformed character"); }
        char low = (char) next;
        if(!Character.isLowSurrogate(low)) {
          throw new IOException("malformed sequence");
        }
        return Character.toCodePoint(high, low);
      } else {
        return codeUnit;
      }
    } finally {
      codeUnit = charSource.read();
    }
  }

  public void close() throws IOException { charSource.close(); }
}
McDowell
  • 107,573
  • 31
  • 204
  • 267
  • So basically you are saying that if I want to convert a sequence of chars to their corresponding code points, I need to loop it with "lookahead": read 2 values, test for isHih/Low, if yes- combine, if no- treat the first individually and continue in the same manner from the next? – Vitaliy Jan 09 '13 at 20:24
  • @Vitaliy - Yes; I've added a simple implementation as an example. – McDowell Jan 09 '13 at 21:35