1

Given a String string in Java, does string.codePoints().toArray().length reflect the length of the String in terms of the actual characters that a human would find meaningful? In other words, does it smooth over escape characters and other artifacts of encoding?

Edit By "human" I kind of meant "programmer" as I would imagine most programmers would see \r\n as two characters, ESC as one character, etc. But now I see that even the accent marks get atomized so it doesn't matter.

tacos_tacos_tacos
  • 10,277
  • 11
  • 73
  • 126
  • It'd be a lot easier to answer your question if you could give a few example Strings and a few results you're looking for. Humans are weird :D – visch Aug 24 '16 at 12:33
  • 3
    A quick test says no: `String s = "s\n";` `length()` = 2 and `codePoints().toArray().length` = 2. – Zircon Aug 24 '16 at 12:34
  • That's not exactly what I meant by my question but you are right. I meant more like... "a programmer would see" – tacos_tacos_tacos Aug 24 '16 at 12:41
  • 3
    It’s not very useful to take a term and add an ambiguous interpretation to it, like taking “*character*” and adding “*that a human would find meaningful*” and later-on adding “*[like] a programmer would see*”. From the standard’s point of view `U+006E U+0303` is “*canonically equivalent*” to `U+00F1` (`ñ`), in other words, it’s the same *character*. A programmer will notice that these are two different `int[]` arrays and there’s already a name for that, these are two different sequences of *code points*. – Holger Aug 24 '16 at 14:37

3 Answers3

10

No.

For example:


Now it is debatable whether some of these might be "actual characters that a human would find meaningful" ... but the overall answer is still No.


You clarified as follows:

By "human" I kind of meant "programmer" as I would imagine most programmers would see \r\n as two characters ...

It is more complicated than that. I am a programmer, and for me it depends on the context whether \r\n are meaningful or not. If I am reading a README file, my brain will treat differences in white space as having no semantic importance. But if I am writing a parser, my code would take whitespace into account ... depending on the language it is intended to parse.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
1

Just check the Javadoc of CharSequence for the codePoints() method :

Returns a stream of code point values from this sequence. Any surrogate pairs encountered in the sequence are combined as if by Character.toCodePoint and the result is passed to the stream. Any other code units, including ordinary BMP characters, unpaired surrogates, and undefined code units, are zero-extended to int values which are then passed to the stream. https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints--

And the one in the String classes related to code points to understand what a code point is :

String(int[] codePoints, int offset, int count) Allocates a new String that contains characters from a subarray of the Unicode code point array argument.https://docs.oracle.com/javase/8/docs/api/java/lang/String.html

A code point is an int representing a Unicode code point (https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#unicode) so all characters are included even those non-human-readable.

loicmathieu
  • 5,181
  • 26
  • 31
0

String object.codePoints() returns a stream of characters in Java 8.On which you are calling toArray method,so it will treat each character in a seperate manner and will return number of characters.

Sakalya
  • 568
  • 5
  • 15