1

Suppose we have a String str = "count".

    String str = "count";
    long c1 = str.length();
    long c2 = str.codePoints().count();
    System.out.println(c1==c2);//true

Here value of c1 and c2 both are same. So my question is, when we will use length() and codePoints().count() method in the program?

Brajesh
  • 1,515
  • 13
  • 18
  • 1
    if all you want is the length then use `str.length();` , `codePoints()` returns an `IntStream` which you then call `count()` upon. – Ousmane D. Apr 27 '18 at 11:47
  • 1
    @PatrickParker No. For example `String s="\uD83D\uDE83";System.out.println(s.length()+"/"+s.codePoints().count());` will return "2/1" –  Apr 27 '18 at 11:57

2 Answers2

6

The difference can be demonstrated by the following code:

    StringBuilder sb = new StringBuilder();
    sb.appendCodePoint(0x12345);
    String s = sb.toString();
    System.out.println(s.length());  // Prints 2
    System.out.println(s.codePoints().count());  // Prints 1

If your string can possibly contain Unicode code points greater than 0xFFFF, then use s.codePoints().count() for a correct[*] result.

If your string only contains Unicoce code points in the Basic Multilingual Plane (i.e. characters between '\u0000' and '\uFFFF' only, i.e. the one you are most likely to use if you don't want to print hieroglyphics or such things) then use s.length() instead as that performs better (lower CPU and memory usage).

Footnote:
[*] By "correct", I mean a count of what a non-technical human user might consider a "character" rather than what length() returns, which is the total number of 16-bit Java characters used to represent the Unicode characters in this string using the UTF-16 encoding - which is a technical measure of length that an ordinary user probably isn't concerned with.

DodgyCodeException
  • 5,963
  • 3
  • 21
  • 42
  • thanks @DodgyCodeException – Brajesh Apr 27 '18 at 12:03
  • @DodgyCodeException that is much better, but to be honest, I am not aware of any processing tool, so that a "human readable" letter could be correctly "counted" using java-code... – Eugene Apr 30 '18 at 08:46
  • @Eugene the subject is quite complicated, especially when you have ligatures like fi which can be represented by either a single ligature character or two separate characters which may appear graphically identical depending on font and kerning; or you can have combining diacritical marks, where two characters are used to represent a single accented character. And then there's \n vs \r\n etc. The whole concept of "length" is hard to define. – DodgyCodeException Apr 30 '18 at 08:49
  • @DodgyCodeException no no, ligature is a completely separate subject - and there are not a lot of them; you can easily build a map with those in your code... I was really meaning `Devanagari letters` as an example; a real alphabet – Eugene Apr 30 '18 at 08:51
  • @DodgyCodeException and to be frank - that is the real problem I have in our application, since we are restricting some inputs "by length", which for different countries mean different things (exotic ones especially). basically we are doing the same thing as twitter does, if you care – Eugene Apr 30 '18 at 08:54
3

A code unit is the number of bits an encoding uses. So UTF-8 would use 8 and UTF-16 would use 16 units. A code point is a character and this is represented by one or more code units depending on the encoding.

This means in Java String.length() returns the number of code units in a string (since it uses UTF-16) so surrogate pairs use two positions.

From quora.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
GhostCat
  • 137,827
  • 25
  • 176
  • 248
  • "This means in Java String.length() returns the number of code units" not sure what you mean by this but the grammar is wrong. String.length does not return the number of codePoints, or the "number of bits" used by the encoding. – Patrick Parker Apr 27 '18 at 13:01
  • @PatrickParker even if answers from a link are highly discouraged and this should be a comment, it's you who is wrong here. length returns the number of code units - read it again/ – Eugene Apr 28 '18 at 19:01
  • 1
    @Eugene - how am I wrong? 1) "A code unit is the number of bits" 2) "String.length() returns the number of code units" 1+2) "String.length() returns the number of number of bits" ? That's why I said the "grammar is wrong" – Patrick Parker Apr 29 '18 at 00:01
  • @PatrickParker I think the OP meant, by "number of bits", not a count of bits, but a "group of bits" or "a collection of bits". Just as you could say "I met a number of people today" - this just says I met some people (precise number is not important). And, by "bits" he means "pieces", not binary digits. – DodgyCodeException Apr 30 '18 at 10:50