20

Without using u flag the hex range that can be used is [\x{00}-\x{ff}], but with u flag it goes up to a 4-byte value \x{7fffffff} ([\x{00000000}-\x{7fffffff}]).

So if I execute the below code:

preg_match("/[\x{00000000}-\x{80000000}]+/u", $str, $match);

Will get this error:

Warning: preg_match(): Compilation failed: character value in \x{...} sequence is too large

So I can't match a letter like with equivalent hex value of f0 a1 83 81. The question is not how to match these letters, but how this range & this boundary came from as u modifier should treat strings as UTF-16

PCRE supports UTF-16 since v8.30

echo PCRE_VERSION;

PCRE version with PHP 5.3.24 - 5.3.28, 5.4.14 - 5.5.7:

8.32 2012-11-30

PCRE version with PHP 5.3.19 - 5.3.23, 5.4.9 - 5.4.13:

8.31 2012-07-06

http://3v4l.org/CrPZ8

revo
  • 47,783
  • 14
  • 74
  • 117
  • 3
    Have you tried `\x{0210c1}`, the real codepoint? – Ry- Jan 06 '14 at 16:33
  • @minitech What's the point? – revo Jan 06 '14 at 16:35
  • 5
    `u modifier should treat strings as UTF-16` where did you get that? Documentation says only about UTF-8 – dev-null-dweller Jan 06 '14 at 16:35
  • @dev-null-dweller PCRE supports native UTF-16 since v8.30 – revo Jan 06 '14 at 17:04
  • But PHP != PCRE. PCRE supports UTF-16 but it have to be enabled - it is not default. From http://www.pcre.org/pcre.txt you can not have UTF8/16/32 in one compilation, and PHP is compiled agianst UTF8 one. – dev-null-dweller Jan 06 '14 at 17:24
  • @dev-null-dweller Right and I'm not going to dig deep for versions however my question is something else, about that weird range. – revo Jan 06 '14 at 17:40
  • 4
    But it should answer your question. This is not weird range, UTF-8 last codepoint is U+7FFFFFFF – dev-null-dweller Jan 06 '14 at 17:44
  • 1
    @dev-null-dweller `In November 2003 UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding.` *Am I wrong?!* – revo Jan 06 '14 at 18:16

5 Answers5

9

Unicode and UTF-8, UTF-16, UTF-32 encoding

Unicode is a character set, which specifies a mapping from characters to code points, and the character encodings (UTF-8, UTF-16, UTF-32) specify how to store the Unicode code points.

In Unicode, a character maps to a single code point, but it can have different representation depending on how it is encoded.

I don't want to rehash this discussion all over again, so if you are still not clear about this, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Using the example in the question, maps to the code point U+210C1, but it can be encoded as F0 A1 83 81 in UTF-8, D844 DCC1 in UTF-16 and 000210C1 in UTF-32.

To be precise, the example above shows how to map a code point to code units (character encoding form). How the code units are mapped to octet sequence is another matter. See Unicode encoding model

PCRE 8-bit, 16-bit and 32-bit library

Since PHP hasn't adopted PCRE2 yet (version 10.10), the quoted text are from the documentation of original PCRE.

Support for 16-bit and 32-bit library

PCRE includes support for 16-bit string in version 8.30 and 32-bit string from version 8.32, in additional to the default 8-bit library.

As well as support for 8-bit character strings, PCRE also supports 16-bit strings (from release 8.30) and 32-bit strings (from release 8.32), by means of two additional libraries. They can be built as well as, or instead of, the 8-bit library. [...]

Meaning of 8-bit, 16-bit, 32-bit

8-bit, 16-bit and 32-bit here refers to the data unit (code unit).

References to bytes and UTF-8 in this document should be read as references to 16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data units and UTF-32 when using the 32-bit library, unless specified otherwise. More details of the specific differences for the 16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.

This means that 8-bit/16-bit/32-bit library expects the pattern and the input string to be sequences of 8-bit/16-bit/32-bit data units, or valid UTF-8/UTF-16/UTF-32 strings.

Different APIs for different width of data unit

PCRE provides 3 sets of identical API for 8-bit, 16-bit and 32-bit libraries, differentiated by the prefix (pcre_, pcre16_ and pcre_32 respectively).

The 16-bit and 32-bit functions operate in the same way as their 8-bit counterparts; they just use different data types for their arguments and results, and their names start with pcre16_ or pcre32_ instead of pcre_. For every option that has UTF8 in its name (for example, PCRE_UTF8), there are corresponding 16-bit and 32-bit names with UTF8 replaced by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the 16-bit and 32-bit option names define the same bit values.

In PCRE2, a similar function naming convention is used, where 8-bit/16-bit/32-bit function has _8, _16, _32 suffix respectively. Applications which use only one code unit width can define PCRE2_CODE_UNIT_WIDTH to use generic name of the function without the suffix.

UTF mode vs. non-UTF mode

When the UTF mode is set (via in-pattern options (*UTF), (*UTF8), (*UTF16), (*UTF32)1 or compile options PCRE_UTF8, PCRE_UTF16, PCRE_UTF32), all sequences of data units are interpreted as sequences of Unicode characters, which consist of all code points from U+0000 to U+10FFFF, except for surrogates and BOM.

1 The in-pattern options (*UTF8), (*UTF16), (*UTF32) are only available in the corresponding library. You can't use (*UTF16) in 8-bit library, nor any mismatched combination, since it simply doesn't make sense. (*UTF) is available in all libraries, and provides a portable way to specify UTF mode in-pattern.

In UTF mode, the pattern (which is a sequence of data units) is interpreted and validated as a sequence of Unicode code points by decoding the sequence as UTF-8/UTF-16/UTF-32 data (depending on the API used), before it is compiled. The input string is also interpreted and optionally validated as a sequence of Unicode code points during the matching process. In this mode, a character class matches one valid Unicode code point.

On the other hand, when the UTF mode is not set (non-UTF mode), all operations directly work on the data unit sequences. In this mode, a character class matches one data unit, and except for the maximum value that can be stored in a single data unit, there is no restriction on the value of a data unit. This mode can be used for matching structure in binary data. However, do not use this mode when you are dealing with Unicode character, well, unless you are fine with ASCII and ignore the rest of the languages.

Constraints on character values

Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows:

8-bit non-UTF mode    less than 0x100
8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
16-bit non-UTF mode   less than 0x10000
16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
32-bit non-UTF mode   less than 0x100000000
32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint

Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called "surrogate" codepoints), and 0xffef.

PHP and PCRE

The PCRE functions in PHP are implemented by a wrapper which translates PHP-specific flags and calls into PCRE API (as seen in PHP 5.6.10 branch).

The source code calls into PCRE 8-bit library API (pcre_), so any string passed into preg_ function is interpreted as a sequence of 8-bit data units (bytes). Therefore, even if the PCRE 16-bit and 32-bit libraries are built, they are not accessible via the API on PHP side at all.

As a result, PCRE functions in PHP expects:

  • ... an array of bytes in non-UTF mode (default), which the library reads in 8-bit "characters" and compiles to match strings of 8-bit "characters".
  • ... an array of bytes which contains a Unicode string UTF-8 encoded, which the library reads in Unicode characters and compiles to match UTF-8 Unicode strings.

This explains the behavior as seen in the question:

  • In non-UTF mode (without u flag), the maximum value in hexadecimal regex escape sequence is FF (as shown in [\x{00}-\x{ff}])
  • In UTF mode, any value beyond 0x10ffff (like \x{7fffffff}) in hexadecimal regex escape sequence is simply non-sense.

Example code

This example code demonstrates:

  • PHP strings are just arrays of bytes and don't understand anything about encoding.
  • The differences between UTF mode and non-UTF mode in PCRE function.
  • PCRE function calls into 8-bit library
// NOTE: Save this file as UTF-8

// Take note of double-quoted string literal, which supports escape sequence and variable expansion
// The code won't work correctly with single-quoted string literal, which has restrictive escape syntax
// Read more at: https://php.net/language.types.string
$str_1 = "\xf0\xa1\x83\x81\xf0\xa1\x83\x81";
$str_2 = "";
$str_3 = "\xf0\xa1\x83\x81\x81\x81\x81\x81\x81";

echo ($str_1 === $str_2)."\n";

var_dump($str_3);

// Test 1a
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_1, $match);
print_r($match); // Only match 

// Test 1b
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_2, $match);
print_r($match); // Only match  (same as 1a)

// Test 1c
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_3, $match);
print_r($match); // Match  and the five bytes of 0x81

// Test 2a
$match = null;
preg_match("/+/", $str_1, $match);
print_r($match); // Only match  (same as 1a)

// Test 2b
$match = null;
preg_match("/+/", $str_2, $match);
print_r($match); // Only match  (same as 1b and 2a)

// Test 2c
$match = null;
preg_match("/+/", $str_3, $match);
print_r($match); // Match  and the five bytes of 0x81 (same as 1c)

// Test 3a
$match = null;
preg_match("/\xf0\xa1\x83\x81+/u", $str_1, $match);
print_r($match); // Match two 

// Test 3b
$match = null;
preg_match("/\xf0\xa1\x83\x81+/u", $str_2, $match);
print_r($match); // Match two  (same as 3a)

// Test 4a
$match = null;
preg_match("/+/u", $str_1, $match);
print_r($match); // Match two  (same as 3a)

// Test 4b
$match = null;
preg_match("/+/u", $str_2, $match);
print_r($match); // Match two  (same as 3b and 4a)

Since PHP strings are simply an array of bytes, as long as the file is saved correctly in some ASCII-compatible encoding, PHP will just happily read the bytes without caring about what encoding it was originally in. The programmer is fully responsible for encoding and decoding the strings correctly.

Due to the above reason, if you save the file above in UTF-8 encoding, you will see that $str_1 and $str_2 are the same string. $str_1 is decodes from the escape sequence, while $str_2 is read verbatim from the source code. As a result, "/\xf0\xa1\x83\x81+/u" and "/+/u" are the same string underneath (also the case for "/\xf0\xa1\x83\x81+/" and "/+/").

The difference between UTF mode and non-UTF mode is clearly shown in the example above:

  • "/+/" is seen as a sequence of characters F0 A1 83 81 2B where "character" is one byte. Therefore, the resulting regex matches the sequence F0 A1 83 followed by byte 81 repeating once or more.
  • "/+/u" is validated and interpreted as a sequence of UTF-8 characters U+210C1 U+002B. Therefore, the resulting regex matches the code point U+210C1 repeated once or more in the UTF-8 string.

Matching Unicode character

Unless the input contains other binary data, it's strongly recommended to always turn u mode on. The pattern has access to all facilities to properly match Unicode characters, and both the input and pattern are validated as valid UTF strings.

Again, using as example, the example above shows two ways to specify the regex:

"/\xf0\xa1\x83\x81+/u"
"/+/u"

The first method doesn't work with single-quoted string -- as \x escape sequence is not recognized in single-quote, the library will receive the string \xf0\xa1\x83\x81+, which combines with UTF mode will match U+00F0 U+00A1 U+0083 followed by U+0081 repeated once or more. Apart from that, it's also confusing to the next person reading the code: how are they supposed to know that it's a single Unicode character repeated once or more?

The second method works well and it can even be used with single-quoted string, but you need to save the file in UTF-8 encoding, especially the case with characters like ÿ, since the character is also valid in single-byte encoding. This method an option if you want to match single character or a sequence of characters. However, as end points of character range, it may not be clear what you are trying to match. Compare a-z, A-Z, 0-9, א-ת, as opposed to 一-龥 (which matches most of CJK Unified Ideographs block (4E00–9FFF) except for unassigned code points at the end) or 一-十 (which is an incorrect attempt to match Chinese characters for number from 1 to 10).

The third method is to specify the code point in hexadecimal escape directly:

"/\x{210C1}/u"
'/\x{210C1}/u'

This works when the file is saved in any ASCII-compatible encoding, works with both single and double-quoted string, and also gives clear code point in character range. This method has the disadvantage of not knowing how the character looks like, and it is also hard to read when specifying a sequence of Unicode characters.

Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • 1
    Perfect explanations as well as a practical demonstration on treating Unicode characters inside Regular Expressions of PCRE flavor. Bounty is yours for that however I had to accept @Sniffer answer as he was quick in providing details on topics I missed. *+1 the others* – revo May 31 '15 at 12:14
  • Ok, so I now know PHP restricts itself to UTF-8. I know they try to do things like Perl does so I'm not surprised. Its funny, pcre allows the UTF-16 construct (Perl doesn't have this) `\u....` I wonder what php does with this. UTF-16 is the bastard in the group. I know there exists mal-formed surrogate sequence's in the wild. Unfortunately, they can't be detected (fixed or removed) in UTF-8 or UTF-32 modes, only within the 16-bit data width of UTF-16. –  May 31 '15 at 17:51
  • @sln: `Its funny, pcre allows the UTF-16 construct` How is `\u....` allowed? It's only available in JS mode (if you call directly to PCRE library), but it is not available by default. PHP also doesn't turn on the flag `I know there exists mal-formed surrogate sequence's in the wild.` If you encode surrogate code point with UTF-8 or UTF-32, they can be detected as if they are just another Unicode character. Btw, too long encoding are disallowed if you enable UTF mode. – nhahtdh May 31 '15 at 23:40
  • `If you encode surrogate code point with UTF-8 or UTF-32, they can be detected as if they are just another Unicode character` A single surrogate does not have a codepoint without its matching pair. Thus the _hole_. That's the fallacy, it can't be operated on in other than 16-bit data widths (UTF-16) Well formed surrogate pairs can be moved between all UTF modes. That's why the Unicode folks shot themselves in the foot, and why there is a 20-bit max codepoint range. –  Jun 01 '15 at 15:08
  • If it were true that a single, unpaired surrogate had a codepoint, then you would be able to convert it from UTF-8 to UTF-16 (just an example), but that's impossible. –  Jun 01 '15 at 15:16
  • @sln: If you think of surrogate has having code point in the range from D800-DFFF, then you can apply the UTF-8 conversion algorithm to convert them into corresponding representation (in UTF-32, it's similar to UTF-16 - 0000DC00 for example). Of course, a correct implementation will not read or write such sequence, and sequences which decodes to surrogate range shall indicate that it encounters something invalid. – nhahtdh Jun 01 '15 at 17:46
  • @sln: Another thing is surrogates are only meaningful in UTF-16 mode (and yes, it does limit the space to 20 bits). Valid surrogate pairs in UTF-16 is like valid sequence of bytes in UTF-8, which encode a valid code point - since they encode valid code point, they can be correctly decoded back to code point and then encoded in other UTF encodings. – nhahtdh Jun 01 '15 at 17:50
  • After working with UTF-16 for the past 5 months, this peeve's me a little. The problem is that the D800-DFFF hole represents a transformation where a codepoint is _generated_ from the first 10-bits of adjacent 16-bit data units. One can't exist without the other. Therefore, a single surrogate cannot be represented as a codepoint, because surrogate pairs represent real codepoints > 0xFFFF, –  Jun 01 '15 at 17:57
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/79350/discussion-between-nhahtdh-and-sln). – nhahtdh Jun 01 '15 at 18:03
  • 1
    @nhahtdh - Thanks for all your info !! I made a Perl test where I injected interpolated surrogates into a UTF-8 string then ran a regex on it. This is artificial and I can't verify if a binary file containing this could be decoded to this form. Regex and string work like you suggest. Code snippet: `while ("surrogate pair \x{D835}\x{DC00} its UTF-8 codepoint \x{01D400}" =~ /([\x{D800}-\x{DFFF}])/g){ printf ("UTF-8 error: Surrogate < 0x%x >\n", ord($1)); }` –  Jun 02 '15 at 17:53
4

So I can't match a letter like with equivalent hex value of f0 a1 83 81. The question is not how to match these letters, but how this range & this boundary came from as u modifier should treat strings as UTF-16

You are mixing two concepts which is causing this confusion.

F0 A1 83 81 isn't the hex value of the character . This is the way UTF-8 encodes the code point for that character in the byte stream.

It is correct that PHP supports UTF-16 code points for the \x{} pattern, but the values inside { and } represent UTF-16 code points and not the actual bytes used to encode the given character in the byte stream.

So the largest possible value you can use with \x{} is actually 10FFFF.

And to match with PHP you need to use it's code point which as suggested by @minitech in his comment is \x{0210c1}.

Further explanation quoted from section "Validity of strings" from the PCRE documentation.

The entire string is checked before any other processing takes place. In addition to checking the format of the string, there is a check to ensure that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate area. The so-called "non-character" code points are not excluded because Unicode corrigendum #9 makes it clear that they should not be.

Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to encode code points with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and UTF-32.)

Ibrahim Najjar
  • 19,178
  • 4
  • 69
  • 95
  • I guess the big question is what does PHP regex engine search for when it see's `` in the regex. –  May 28 '15 at 18:18
  • One of these two are an error in UTF-16 mode: `\x{210C1}` or `\x{D844}\x{DCC1}` the other one doesn't match in UTF-32 mode. –  May 28 '15 at 18:25
  • @sln For the first question you will have to check the implementation of the regex engine. I am sorry I don't fully understand your second question ? – Ibrahim Najjar May 28 '15 at 23:10
  • In UTF-16 mode, the character unit size is 16-bits, The source and regex strings are converted into natural unsigned word. In UTF-16, codepoints above U+00FFFF are converted into 2 words, called a surrogate pair, but it _is_ 2 distinct physical character units in the source string. That's why this `\x{210C1}` is not equal to this `\x{D844}\x{DCC1}` Codepoints above U+00FFFF however are represented the same `\x{210C1}` in UTF-8 and UTF-32 modes. –  May 29 '15 at 15:13
  • @sln: PHP uses 8-bit API. Period. Even if the library is compiled with 16-bit API enabled, it is not used at all. – nhahtdh May 31 '15 at 06:02
1

I'm not sure about php but there really is no governor on code points
so it doesn't matter that there are only some 1.1 million valid ones.
That is subject to change at any time, but its not really up to engines
to enforce that. There are reserved cp's that are holes in the valid range,
there are surrogates in the valid range, the reasons are endless for there
to be no other restriction other than the word size.

For UTF-32, you can't go over 31 bits because 32 is the sign bit.
0x00000000 - 0x7FFFFFFF

Makes sense since unsigned int as a data type is the natural size of 32-bit hardware registers.

For UTF-16, even truer you can see the same limitation masked to 16 bit. Bit 32 is still the sign bit leaving 0x0000 - 0xFFFF as a valid range.

Usually, if you use an engine that supports ICU you should be able to use it,
which converts both source and regex into UTF-32. Boost Regex is one such engine.

edit:

Regarding UTF-16

I guess when Unicode outgrew 16 bit, they punched a hole in the 16-bit range for surrogate pairs. But it left only 20 total bits between the pair as useable.

10 bits in each surrogate with the other 6 used to determine hi or lo.
Looks like this left the Unicode folks with a limit of 20 bits + an extra 0xFFFF rounded, to a total of 0x10FFFF codepoints, with unusable holes.

To be able to convert to a different encoding (8/16/32) all the codepoints
must actually be convertible. Thus the forever backward compatibile 20-bit is
the trap they ran into early, but now must live with.

Regardless, regex engines won't be enforcing this limit anytime soon, probably never.
As far as surrogates, they are the hole, and an mal-formed literal surrogate can't be converted between modes. That just pertains to a literal encoded character during conversion, not a hex representation of one. For instance its easy to search a text in UTF-16 (only) mode for unpaired surrogates, or even paired one's.

But I guess regex engines don't really care about holes or limits, they only care about what mode the subject string is in. No, the engine is not going to say:
'Hey wait, the mode is UTF-16 I better convert \x{210C1} to \x{D844}\x{DCC1}. Wait, if I did that, what do I do if its quantified \x{210C1}+,start injecting regex constructs around it? Worse yet, what if its in a class [\x{210C1}]? Nah.. better limit it to \x{FFFF}.

Some handy dandy, pseudo-code surrogate conversions I use:

 Definitions:
 ====================
 10-bits
  3FF = 000000  1111111111

 Hi Surrogate
 D800 = 110110  0000000000
 DBFF = 110110  1111111111 

 Lo Surrogate
 DC00 = 110111  0000000000
 DFFF = 110111  1111111111


 Conversions:
 ====================
 UTF-16 Surrogates to UTF-32
 if ( TESTFOR_SURROGATE_PAIR(hi,lo) )
 {
    u32Out = 0x10000 + (  ((hi & 0x3FF) << 10) | (lo & 0x3FF)  );
 }

 UTF-32 to UTF-16 Surrogates
 if ( u32In >= 0x10000)
 {
    u32In -= 0x10000;
    hi = (0xD800 + ((u32In & 0xFFC00) >> 10));
    lo = (0xDC00 + (u32In & 0x3FF));
 }

 Macro's:
 ====================
 #define TESTFOR_SURROGATE_HI(hs) (((hs & 0xFC00)) == 0xD800 )
 #define TESTFOR_SURROGATE_LO(ls) (((ls & 0xFC00)) == 0xDC00 )
 #define TESTFOR_SURROGATE_PAIR(hs,ls) ( (((hs & 0xFC00)) == 0xD800) && (((ls & 0xFC00)) == 0xDC00) )
 //
 #define PTR_TESTFOR_SURROGATE_HI(ptr) (((*ptr & 0xFC00)) == 0xD800 )
 #define PTR_TESTFOR_SURROGATE_LO(ptr) (((*ptr & 0xFC00)) == 0xDC00 )
 #define PTR_TESTFOR_SURROGATE_PAIR(ptr) ( (((*ptr & 0xFC00)) == 0xD800) && (((*(ptr+1) & 0xFC00)) == 0xDC00) )
1

As minitech suggests in the first comment, you have to use the codepoint - for this character, it's \x{210C1}. That's also the encoded form in UTF-32. F0 AF AB BF is the UTF-8 encoded sequence (see http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=210C1).

There are some versions of PCRE where you can use values up to \x{7FFFFFFF}. But I really don't know what could be matched with it.

To quote http://www.pcre.org/pcre.txt:

In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff because those are "surrogate" values that are used in pairs to encode values greater than 0xffff.

[...]

In UTF-32 mode, the character code is Unicode, in the range 0 to 0x10ffff, with the exception of values in the range 0xd800 to 0xdfff because those are "surrogate" values that are ill-formed in UTF-32.

0x10ffff is the largest value you can use to match a character (that's what I extract from this). 0x10ffff is currently also the largest code point defined in the unicode standard (see What are some of the differences between the UTFs?) - thus every value above does not make any sense (or I just don't get it)...

Wolfgang Kluge
  • 895
  • 8
  • 13
  • I think in natural UTF-16 mode the the maximum code point runs only up to 65535 decimal or 0xffff hex. The codepoint is extended to 0x10ffff via the introduction of a surrogate pair (2 utf-16 values). However, in utf-16 mode, codepoints above 0xffff can't be represented via `\x{210c1}`, it has to be broken down (represented) as a surrogate pair ( ie. 2 utf-16 values )..I think engines that run utf-8 will accept ie. \x{210c1}. The fastest ones use natural utf-16 with the ability to promote to utf-32. I think the ICU library is UTF-32. –  May 28 '15 at 17:32
  • @sln you have to distinguish between the codepoints and it's representation as bytes. Unicode currently defines codepoints from 0 to 0x10ffff. Now you have to represent this data - and that's defined with UTF-whatever... See "Encoding Forms" in http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=210C1 for an example of how a multibyte character is encoded – Wolfgang Kluge May 28 '15 at 17:37
  • I think Unicode is UTF-16 and uses surrogate pairs to get up to and past 0x10ffff. Whether or not it has a glyph is irrelavent. –  May 28 '15 at 17:39
  • I just don't see engines as the arbiter of codepoint range. I mean, how would they know this. ICU assigns/un-assign's planes all the time. There are many holes in there. More likely, its up to the processor that assembles a memory image. Its conceivable you can process 32-bit binary data with regex if you can get the binary file into a subject string (following string rules, etc..). Even then, you can cast anything to string pointer. Also, UTF-32 should be called UTF-31, it looks like it only uses unsigned 32 bit int's. –  May 28 '15 at 17:54
  • @sln I fully agree - but at the end there are 0x10ffff code points in unicode, no matter how its encoded and UTF-16 is "only" an encoding. UTF-16 defines the surrogate pairs you mention to extend it. For codepoints > 0xFFFF you need 4 instead of 2 bytes... I presume, the decoding is done before the text even reaches the regex engine (otherwise the engine should have to process the text byte by byte?!).. The same goes for endianess. – Wolfgang Kluge May 28 '15 at 17:59
  • The first step past ASCII 8-bit to extend char sets was UTF-16 and Unicode was born. 0x0 - 0xffff was meant to cover (and still does - most) all languages. It wasn't until later that there was a need for > 0xffff codepoints. Even later UTF-32 then what UTF-8 because of the prevalence of ASCII chars and performance issues. –  May 28 '15 at 18:13
  • Just to mention one thing.. I don't see UTF-16 as an internal _natural_ encoding, going away anytime soon (probably never). And won't be supplanted by UTF-8, since it encompass almost all the language character sets out there. –  May 28 '15 at 18:44
-1

"but want to know about the max hex boundary in a regex": * in all utf modes: 0x10ffff * native 8-bt mode: 0xff * native 16-bit mode: 0xffff * native 32-bit mode: 0x1fffffff

dark100
  • 87
  • 2