If I use Java 8's String.codePoints to get an array of int codePoints, is it true that the length of the array is the count of characters?

Question

Given a String string in Java, does string.codePoints().toArray().length reflect the length of the String in terms of the actual characters that a human would find meaningful? In other words, does it smooth over escape characters and other artifacts of encoding?

Edit By "human" I kind of meant "programmer" as I would imagine most programmers would see \r\n as two characters, ESC as one character, etc. But now I see that even the accent marks get atomized so it doesn't matter.

It'd be a lot easier to answer your question if you could give a few example Strings and a few results you're looking for. Humans are weird :D — visch, Aug 24 '16 at 12:33
A quick test says no: `String s = "s\n";` `length()` = 2 and `codePoints().toArray().length` = 2. — Zircon, Aug 24 '16 at 12:34
That's not exactly what I meant by my question but you are right. I meant more like... "a programmer would see" — tacos_tacos_tacos, Aug 24 '16 at 12:41
It’s not very useful to take a term and add an ambiguous interpretation to it, like taking “*character*” and adding “*that a human would find meaningful*” and later-on adding “*[like] a programmer would see*”. From the standard’s point of view `U+006E U+0303` is “*canonically equivalent*” to `U+00F1` (`ñ`), in other words, it’s the same *character*. A programmer will notice that these are two different `int[]` arrays and there’s already a name for that, these are two different sequences of *code points*. — Holger, Aug 24 '16 at 14:37

Stephen C · Accepted Answer · 2016-08-24T12:53:59.887

No.

For example:

Control characters (such as ESC, CR, NL, etcetera) will not be removed. These have distinct codepoints in Unicode.
Sequences of spaces, tabs, etc are not combined
Discretionary hyphen (http://www.fileformat.info/info/unicode/char/00AD/index.htm) characters are not removed.
Unicode combining characters (https://en.wikipedia.org/wiki/Combining_character) are not combined.

Now it is debatable whether some of these might be "actual characters that a human would find meaningful" ... but the overall answer is still No.

You clarified as follows:

By "human" I kind of meant "programmer" as I would imagine most programmers would see \r\n as two characters ...

It is more complicated than that. I am a programmer, and for me it depends on the context whether \r\n are meaningful or not. If I am reading a README file, my brain will treat differences in white space as having no semantic importance. But if I am writing a parser, my code would take whitespace into account ... depending on the language it is intended to parse.

score 1 · Answer 2 · answered Aug 24 '16 at 12:46

Just check the Javadoc of CharSequence for the codePoints() method :

Returns a stream of code point values from this sequence. Any surrogate pairs encountered in the sequence are combined as if by Character.toCodePoint and the result is passed to the stream. Any other code units, including ordinary BMP characters, unpaired surrogates, and undefined code units, are zero-extended to int values which are then passed to the stream. https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints--

And the one in the String classes related to code points to understand what a code point is :

String(int[] codePoints, int offset, int count) Allocates a new String that contains characters from a subarray of the Unicode code point array argument.https://docs.oracle.com/javase/8/docs/api/java/lang/String.html

A code point is an int representing a Unicode code point (https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#unicode) so all characters are included even those non-human-readable.

score 0 · Answer 3 · answered Aug 24 '16 at 12:43

0

String object.codePoints() returns a stream of characters in Java 8.On which you are calling toArray method,so it will treat each character in a seperate manner and will return number of characters.

answered Aug 24 '16 at 12:43

Sakalya

568
5
15

If I use Java 8's String.codePoints to get an array of int codePoints, is it true that the length of the array is the count of characters?

3 Answers3

Linked

Related