1

In java what's different between Character.isBmpCodePoint and Character.isValidCodePoint?

I mean, I know 0x10FFFF and 0xFFFF, but what does it imply? Which should I use?

FredSuvn
  • 1,869
  • 2
  • 12
  • 19
  • 1
    BMP: Basic multilingual plane, so the code points from 0 to 65535 (note: not all code points are valid) – Giacomo Catenazzi Sep 13 '21 at 14:20
  • @FredSuvn Thanks for the clarifying you added to your question. Now my answer is obviously too short, and I wish to expand it. But this takes a bit of time, there is much to learn about it as can be already seen in the comments. Also, Wikipedia has a lot of information about the topic. I hope that I'll be able to give an overview of it. – Wolf Sep 14 '21 at 09:47

1 Answers1

3

The Basic Multilingual Plane (BMP) is a subset of legal code points in Unicode (see Wikipedia).

But let's have a look into the official documentation.

isValidCodePoint

true if the specified code point value is between MIN_CODE_POINT and MAX_CODE_POINT inclusive; false otherwise.

  • MIN_CODE_POINT: U+0000
  • MAX_CODE_POINT: U+10FFFF

isBmpCodePoint

true if the specified code point is between MIN_VALUE and MAX_VALUE inclusive; false otherwise.

  • MIN_VALUE: '\u0000'
  • MAX_VALUE: '\uFFFF'

The documentation has a slightly confusing usage of types here, but it's easy to see that the upper inclusive limits differ, 0xFFFF is below 10FFFF.

Wolf
  • 9,679
  • 7
  • 62
  • 108
  • 2
    It should also be noted that the earliest versions of Unicode were designed as a 16-bit character encoding (i.e. without the extensibility that it has today and limited to ~64k characters). That's why early adopters of unicode did problematic things like simply replace `char_t` with `wchar_t` (i.e. assume that 16bits are going to be enough for everyone). And those "lower 16 bit" are what now makes up the BMP. I.e. that's the part that *everyone* supports, even old Unicode software. Character points beyond that require "modern" software. Unicode 2.0 is when the 16-bit restriction was removed. – Joachim Sauer Sep 13 '21 at 15:05
  • @JoachimSauer Thanks for the comment. When doing some Java 25 years ago, I (from a C/C++ background) learned that a `char` is just a 16 bit entity. If that was true at all: how is this extensibility handled in Java today, UTF-16 maybe? – Wolf Sep 13 '21 at 15:25
  • 2
    a `char` in Java simply can't hold all Unicode codepoints, so it's effectively a UTF-16 codepoint. `String` handles the "UTF-16" encoding and has methods mentioning `*codePoint*` which do the nasty job of finding and handling UTF-16 surrogates and converting them into proper codepoints. Basically as long as you don't use `char` and only `String` it'll be handled by Java for you (and that's actually not as hard as it sounds). That's also why the two method above take `int` instead of `char`. – Joachim Sauer Sep 13 '21 at 15:35
  • @JoachimSauer I see, thanks for the clarification. I currently don't use Java, but I'm happy with Unicode in Python and Perl. But in retrospect I am grateful for that lesson (taught by Java) that a `char` is not just a sort of `int` like in C. – Wolf Sep 13 '21 at 15:45
  • 2
    Note that `IsValidCodePoint()` is a little misleading, as not all numeric values in the range `0x0000..0x10FFFF` are actually valid codepoints. Many values in that range are either reserved or undefined. `IsValidCodePoint()` is fine if all you are concerned about is encoding integers into UTF-16, but for actual text validation, you need to account for codepoints that Unicode does not deem to be valid for text purposes, and `IsValidCodePoint()` won't help you with that. – Remy Lebeau Sep 13 '21 at 17:04
  • 2
    The big unspoken truth is that the `char` type in Java is legacy, and basically broken. When working individually character-by-character, use [code point](https://en.wikipedia.org/wiki/Code_point) integer numbers, not `char`. Peruse the API such as `Character`, `String`, and `StringBuilder` to see the various `…codePoint…` methods that have been added over the years. Using `char` in your code is a great way to invite maddening bugs later in production when your app encounters one of the *majority* of Unicode characters that cannot be represented in a `char` value. – Basil Bourque Sep 13 '21 at 17:19