I have an instance of java.lang.CharSequence
. I need to determine whether this instance is a sequence of Unicode scalar values (that is, whether the instance is in UTF-16 encoding form). Despite the assurances of java.lang.String
, a Java string is not necessarily in UTF-16 encoding form (at least not according to the latest Unicode specification, currently 6.2), since it may contain isolated surrogate code units. (A Java string is, however, a Unicode 16-bit string.)
There are several obvious ways in which to go about this, including:
- Iterate over the code points of the sequence, explicitly validating each as a Unicode scalar value.
- Use a regular expression to search for isolated surrogate code points.
- Pipe the character sequence through a character-set encoder that reports encoding errors.
It seems as though something like this should already exist as a library function, however. I just can't find it in the standard API. Am I missing it, or do I need to implement it?