In java what's different between Character.isBmpCodePoint and Character.isValidCodePoint

Question

In java what's different between Character.isBmpCodePoint and Character.isValidCodePoint?

I mean, I know 0x10FFFF and 0xFFFF, but what does it imply? Which should I use?

BMP: Basic multilingual plane, so the code points from 0 to 65535 (note: not all code points are valid) — Giacomo Catenazzi, Sep 13 '21 at 14:20
@FredSuvn Thanks for the clarifying you added to your question. Now my answer is obviously too short, and I wish to expand it. But this takes a bit of time, there is much to learn about it as can be already seen in the comments. Also, Wikipedia has a lot of information about the topic. I hope that I'll be able to give an overview of it. — Wolf, Sep 14 '21 at 09:47

Wolf · Answer 1 · 2021-09-13T14:59:51.520

3

The Basic Multilingual Plane (BMP) is a subset of legal code points in Unicode (see Wikipedia).

But let's have a look into the official documentation.

isValidCodePoint

true if the specified code point value is between MIN_CODE_POINT and MAX_CODE_POINT inclusive; false otherwise.

MIN_CODE_POINT: U+0000
MAX_CODE_POINT: U+10FFFF

isBmpCodePoint

true if the specified code point is between MIN_VALUE and MAX_VALUE inclusive; false otherwise.

MIN_VALUE: '\u0000'
MAX_VALUE: '\uFFFF'

The documentation has a slightly confusing usage of types here, but it's easy to see that the upper inclusive limits differ, 0xFFFF is below 10FFFF.

edited Sep 13 '21 at 14:59

answered Sep 13 '21 at 14:53

Wolf

9,679
7
62
108

2

It should also be noted that the earliest versions of Unicode were designed as a 16-bit character encoding (i.e. without the extensibility that it has today and limited to ~64k characters). That's why early adopters of unicode did problematic things like simply replace `char_t` with `wchar_t` (i.e. assume that 16bits are going to be enough for everyone). And those "lower 16 bit" are what now makes up the BMP. I.e. that's the part that *everyone* supports, even old Unicode software. Character points beyond that require "modern" software. Unicode 2.0 is when the 16-bit restriction was removed. – Joachim Sauer Sep 13 '21 at 15:05
@JoachimSauer Thanks for the comment. When doing some Java 25 years ago, I (from a C/C++ background) learned that a `char` is just a 16 bit entity. If that was true at all: how is this extensibility handled in Java today, UTF-16 maybe? – Wolf Sep 13 '21 at 15:25
2

a `char` in Java simply can't hold all Unicode codepoints, so it's effectively a UTF-16 codepoint. `String` handles the "UTF-16" encoding and has methods mentioning `*codePoint*` which do the nasty job of finding and handling UTF-16 surrogates and converting them into proper codepoints. Basically as long as you don't use `char` and only `String` it'll be handled by Java for you (and that's actually not as hard as it sounds). That's also why the two method above take `int` instead of `char`. – Joachim Sauer Sep 13 '21 at 15:35
@JoachimSauer I see, thanks for the clarification. I currently don't use Java, but I'm happy with Unicode in Python and Perl. But in retrospect I am grateful for that lesson (taught by Java) that a `char` is not just a sort of `int` like in C. – Wolf Sep 13 '21 at 15:45
2

Note that `IsValidCodePoint()` is a little misleading, as not all numeric values in the range `0x0000..0x10FFFF` are actually valid codepoints. Many values in that range are either reserved or undefined. `IsValidCodePoint()` is fine if all you are concerned about is encoding integers into UTF-16, but for actual text validation, you need to account for codepoints that Unicode does not deem to be valid for text purposes, and `IsValidCodePoint()` won't help you with that. – Remy Lebeau Sep 13 '21 at 17:04
2

The big unspoken truth is that the `char` type in Java is legacy, and basically broken. When working individually character-by-character, use [code point](https://en.wikipedia.org/wiki/Code_point) integer numbers, not `char`. Peruse the API such as `Character`, `String`, and `StringBuilder` to see the various `…codePoint…` methods that have been added over the years. Using `char` in your code is a great way to invite maddening bugs later in production when your app encounters one of the *majority* of Unicode characters that cannot be represented in a `char` value. – Basil Bourque Sep 13 '21 at 17:19

In java what's different between Character.isBmpCodePoint and Character.isValidCodePoint

1 Answers1

isValidCodePoint

isBmpCodePoint