How to verify whether an instance of CharSequence is a sequence of Unicode scalar values?

Question

I have an instance of java.lang.CharSequence. I need to determine whether this instance is a sequence of Unicode scalar values (that is, whether the instance is in UTF-16 encoding form). Despite the assurances of java.lang.String, a Java string is not necessarily in UTF-16 encoding form (at least not according to the latest Unicode specification, currently 6.2), since it may contain isolated surrogate code units. (A Java string is, however, a Unicode 16-bit string.)

There are several obvious ways in which to go about this, including:

Iterate over the code points of the sequence, explicitly validating each as a Unicode scalar value.
Use a regular expression to search for isolated surrogate code points.
Pipe the character sequence through a character-set encoder that reports encoding errors.

It seems as though something like this should already exist as a library function, however. I just can't find it in the standard API. Am I missing it, or do I need to implement it?

There is a [isValidCodepoint](http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isValidCodePoint%28int%29) function. All it needs is some extra filter to remove surrogates. — R. Martinho Fernandes, Apr 04 '13 at 10:45
@R.MartinhoFernandes The isValidCodePoint function determines whether an int value falls within the range of Unicode code points. However, the range of Unicode scalar values is a restriction on the range of Unicode code points. — Nathan Ryan, Apr 04 '13 at 10:52
Well, my point is that isValidCodepoint is the best you have. I believe you will have to get this validation from an external library (like ICU) or do it yourself. — R. Martinho Fernandes, Apr 04 '13 at 11:03
@R.MartinhoFernandes Hmm, the ICUJ libraries are a good idea, if a bit heavy. I might need them for some other stuff, though, so worth a look. Thanks. — Nathan Ryan, Apr 04 '13 at 11:24

Evgeniy Dorofeev · Accepted Answer · 2013-04-04T11:31:44.300

1

try this func

static boolean isValidUTF16(String s) {
    for (int i = 0; i < s.length(); i++) {
        if (Character.isLowSurrogate(s.charAt(i)) && (i == 0 || !Character.isHighSurrogate(s.charAt(i - 1)))
                || Character.isHighSurrogate(s.charAt(i)) && (i == s.length() -1 || !Character.isLowSurrogate(s.charAt(i + 1)))) {
            return false;
        }
    }
    return true;
}

here's a test

public static void main(String args[]) {
    System.out.println(isValidUTF16("\uDC00\uDBFF"));
    System.out.println(isValidUTF16("\uDBFF\uDC00"));
}

edited Apr 04 '13 at 11:31

answered Apr 04 '13 at 11:05

Evgeniy Dorofeev

133,369
30
199
275

That'll work. It can be done a bit more efficiently by iterating over code points rather than characters, but that's effectively one of the possible implementations I listed in the question. I think you have an extra condition at the end, though, since `i` can never equal `s.length()`. – Nathan Ryan Apr 04 '13 at 11:28
thanks 'for extra condition', fixed. This was because of refactoring and long lines length in the code, didnt notice it. – Evgeniy Dorofeev Apr 04 '13 at 11:33

How to verify whether an instance of CharSequence is a sequence of Unicode scalar values?

1 Answers1