Unicode and UTF-8, UTF-16, UTF-32 encoding
Unicode is a character set, which specifies a mapping from characters to code points, and the character encodings (UTF-8, UTF-16, UTF-32) specify how to store the Unicode code points.
In Unicode, a character maps to a single code point, but it can have different representation depending on how it is encoded.
I don't want to rehash this discussion all over again, so if you are still not clear about this, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Using the example in the question,
maps to the code point U+210C1
, but it can be encoded as F0 A1 83 81
in UTF-8, D844 DCC1
in UTF-16 and 000210C1
in UTF-32.
To be precise, the example above shows how to map a code point to code units (character encoding form). How the code units are mapped to octet sequence is another matter. See Unicode encoding model
PCRE 8-bit, 16-bit and 32-bit library
Since PHP hasn't adopted PCRE2 yet (version 10.10), the quoted text are from the documentation of original PCRE.
Support for 16-bit and 32-bit library
PCRE includes support for 16-bit string in version 8.30 and 32-bit string from version 8.32, in additional to the default 8-bit library.
As well as support for 8-bit character strings, PCRE also supports 16-bit strings (from release 8.30) and 32-bit strings (from release 8.32), by means of two additional libraries. They can be built as well as, or instead of, the 8-bit library. [...]
Meaning of 8-bit, 16-bit, 32-bit
8-bit, 16-bit and 32-bit here refers to the data unit (code unit).
References to bytes and UTF-8 in this document should be read as references to 16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data units and UTF-32 when using the 32-bit library, unless specified otherwise. More details of the specific differences for the 16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.
This means that 8-bit/16-bit/32-bit library expects the pattern and the input string to be sequences of 8-bit/16-bit/32-bit data units, or valid UTF-8/UTF-16/UTF-32 strings.
Different APIs for different width of data unit
PCRE provides 3 sets of identical API for 8-bit, 16-bit and 32-bit libraries, differentiated by the prefix (pcre_
, pcre16_
and pcre_32
respectively).
The 16-bit and 32-bit functions operate in the same way as their 8-bit counterparts; they just use different data types for their arguments and results, and their names start with pcre16_
or pcre32_
instead of pcre_
. For every option that has UTF8 in its name (for example, PCRE_UTF8
), there are corresponding 16-bit and 32-bit names with UTF8 replaced by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the 16-bit and 32-bit option names define the same bit values.
In PCRE2, a similar function naming convention is used, where 8-bit/16-bit/32-bit function has _8
, _16
, _32
suffix respectively. Applications which use only one code unit width can define PCRE2_CODE_UNIT_WIDTH
to use generic name of the function without the suffix.
UTF mode vs. non-UTF mode
When the UTF mode is set (via in-pattern options (*UTF)
, (*UTF8)
, (*UTF16)
, (*UTF32)
1 or compile options PCRE_UTF8
, PCRE_UTF16
, PCRE_UTF32
), all sequences of data units are interpreted as sequences of Unicode characters, which consist of all code points from U+0000 to U+10FFFF, except for surrogates and BOM.
1 The in-pattern options (*UTF8)
, (*UTF16)
, (*UTF32)
are only available in the corresponding library. You can't use (*UTF16)
in 8-bit library, nor any mismatched combination, since it simply doesn't make sense. (*UTF)
is available in all libraries, and provides a portable way to specify UTF mode in-pattern.
In UTF mode, the pattern (which is a sequence of data units) is interpreted and validated as a sequence of Unicode code points by decoding the sequence as UTF-8/UTF-16/UTF-32 data (depending on the API used), before it is compiled. The input string is also interpreted and optionally validated as a sequence of Unicode code points during the matching process. In this mode, a character class matches one valid Unicode code point.
On the other hand, when the UTF mode is not set (non-UTF mode), all operations directly work on the data unit sequences. In this mode, a character class matches one data unit, and except for the maximum value that can be stored in a single data unit, there is no restriction on the value of a data unit. This mode can be used for matching structure in binary data. However, do not use this mode when you are dealing with Unicode character, well, unless you are fine with ASCII and ignore the rest of the languages.
Constraints on character values
Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows:
8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called "surrogate" codepoints), and 0xffef.
PHP and PCRE
The PCRE functions in PHP are implemented by a wrapper which translates PHP-specific flags and calls into PCRE API (as seen in PHP 5.6.10 branch).
The source code calls into PCRE 8-bit library API (pcre_
), so any string passed into preg_
function is interpreted as a sequence of 8-bit data units (bytes). Therefore, even if the PCRE 16-bit and 32-bit libraries are built, they are not accessible via the API on PHP side at all.
As a result, PCRE functions in PHP expects:
- ... an array of bytes in non-UTF mode (default), which the library reads in 8-bit "characters" and compiles to match strings of 8-bit "characters".
- ... an array of bytes which contains a Unicode string UTF-8 encoded, which the library reads in Unicode characters and compiles to match UTF-8 Unicode strings.
This explains the behavior as seen in the question:
- In non-UTF mode (without
u
flag), the maximum value in hexadecimal regex escape sequence is FF (as shown in [\x{00}-\x{ff}]
)
- In UTF mode, any value beyond 0x10ffff (like
\x{7fffffff}
) in hexadecimal regex escape sequence is simply non-sense.
Example code
This example code demonstrates:
- PHP strings are just arrays of bytes and don't understand anything about encoding.
- The differences between UTF mode and non-UTF mode in PCRE function.
- PCRE function calls into 8-bit library
// NOTE: Save this file as UTF-8
// Take note of double-quoted string literal, which supports escape sequence and variable expansion
// The code won't work correctly with single-quoted string literal, which has restrictive escape syntax
// Read more at: https://php.net/language.types.string
$str_1 = "\xf0\xa1\x83\x81\xf0\xa1\x83\x81";
$str_2 = "";
$str_3 = "\xf0\xa1\x83\x81\x81\x81\x81\x81\x81";
echo ($str_1 === $str_2)."\n";
var_dump($str_3);
// Test 1a
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_1, $match);
print_r($match); // Only match
// Test 1b
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_2, $match);
print_r($match); // Only match (same as 1a)
// Test 1c
$match = null;
preg_match("/\xf0\xa1\x83\x81+/", $str_3, $match);
print_r($match); // Match and the five bytes of 0x81
// Test 2a
$match = null;
preg_match("/+/", $str_1, $match);
print_r($match); // Only match (same as 1a)
// Test 2b
$match = null;
preg_match("/+/", $str_2, $match);
print_r($match); // Only match (same as 1b and 2a)
// Test 2c
$match = null;
preg_match("/+/", $str_3, $match);
print_r($match); // Match and the five bytes of 0x81 (same as 1c)
// Test 3a
$match = null;
preg_match("/\xf0\xa1\x83\x81+/u", $str_1, $match);
print_r($match); // Match two
// Test 3b
$match = null;
preg_match("/\xf0\xa1\x83\x81+/u", $str_2, $match);
print_r($match); // Match two (same as 3a)
// Test 4a
$match = null;
preg_match("/+/u", $str_1, $match);
print_r($match); // Match two (same as 3a)
// Test 4b
$match = null;
preg_match("/+/u", $str_2, $match);
print_r($match); // Match two (same as 3b and 4a)
Since PHP strings are simply an array of bytes, as long as the file is saved correctly in some ASCII-compatible encoding, PHP will just happily read the bytes without caring about what encoding it was originally in. The programmer is fully responsible for encoding and decoding the strings correctly.
Due to the above reason, if you save the file above in UTF-8 encoding, you will see that $str_1
and $str_2
are the same string. $str_1
is decodes from the escape sequence, while $str_2
is read verbatim from the source code. As a result, "/\xf0\xa1\x83\x81+/u"
and "/+/u"
are the same string underneath (also the case for "/\xf0\xa1\x83\x81+/"
and "/+/"
).
The difference between UTF mode and non-UTF mode is clearly shown in the example above:
"/+/"
is seen as a sequence of characters F0 A1 83 81 2B
where "character" is one byte. Therefore, the resulting regex matches the sequence F0 A1 83
followed by byte 81
repeating once or more.
"/+/u"
is validated and interpreted as a sequence of UTF-8 characters U+210C1 U+002B
. Therefore, the resulting regex matches the code point U+210C1
repeated once or more in the UTF-8 string.
Matching Unicode character
Unless the input contains other binary data, it's strongly recommended to always turn u
mode on. The pattern has access to all facilities to properly match Unicode characters, and both the input and pattern are validated as valid UTF strings.
Again, using
as example, the example above shows two ways to specify the regex:
"/\xf0\xa1\x83\x81+/u"
"/+/u"
The first method doesn't work with single-quoted string -- as \x
escape sequence is not recognized in single-quote, the library will receive the string \xf0\xa1\x83\x81+
, which combines with UTF mode will match U+00F0 U+00A1 U+0083
followed by U+0081
repeated once or more. Apart from that, it's also confusing to the next person reading the code: how are they supposed to know that it's a single Unicode character repeated once or more?
The second method works well and it can even be used with single-quoted string, but you need to save the file in UTF-8 encoding, especially the case with characters like ÿ
, since the character is also valid in single-byte encoding. This method an option if you want to match single character or a sequence of characters. However, as end points of character range, it may not be clear what you are trying to match. Compare a-z
, A-Z
, 0-9
, א-ת
, as opposed to 一-龥
(which matches most of CJK Unified Ideographs block (4E00–9FFF) except for unassigned code points at the end) or 一-十
(which is an incorrect attempt to match Chinese characters for number from 1 to 10).
The third method is to specify the code point in hexadecimal escape directly:
"/\x{210C1}/u"
'/\x{210C1}/u'
This works when the file is saved in any ASCII-compatible encoding, works with both single and double-quoted string, and also gives clear code point in character range. This method has the disadvantage of not knowing how the character looks like, and it is also hard to read when specifying a sequence of Unicode characters.