What do actually mean by 'invalid code unit sequences' in PHP? How do 'invalid code unit sequences' work in htmlspecialchars()? Need examples of it

Question

So, I always need answers specific to PHP rather than generalized answers given by considering technologies other than PHP.

I'm using PHP 7.3.3 on my laptop that runs on Windows 10 Home Single Language 64-bit operating system.

I've installed the latest version of XAMPP installer on my laptop which has installed the Apache/2.4.38 (Win64) and PHP 7.3.3

Today, I come across following text from the PHP Manual describing the flags parameters' possible values :

flags

A bitmask of one or more of the following flags, which specify how to handle quotes, invalid code unit sequences and the used document type. The default is ENT_COMPAT | ENT_HTML401.

From the above text I didn't understand what does actually mean by invalid code unit sequences specifically in PHP.

I also nowhere get a definition, explanation or an example of invalid code unit sequences.

I kindly request you to please provide me few good examples of invalid code unit sequences.

Also, explain to me how this phenomenon of invalid code unit sequences works with the built-in function htmlspecialchars()?

An answer accompanied by suitable working code example would be cherished.

Thanks.

See http://kunststube.net/encoding to learn some basics of character sets and encodings. From there the answer is pretty simple: it’s some combination of bytes which are invalid (“make no sense”) in a given encoding. — deceze, Mar 27 '19 at 19:17
also a good resource: https://nikic.github.io/2012/01/28/htmlspecialchars-improvements-in-PHP-5-4.html — Sindhara, Mar 27 '19 at 19:21
@deceze : The article you provided seems to be very old. It's from the year **2015** which means almost 4 years back from now. That was an era of **PHP5**. If you are not aware of the changes, let me tell you we are in the modern era of **PHP7.3**. In the last four years PHP has been changed tremendously. Huge changes has been made to the **core of PHP** itself. So, I'm bit doubtful about the **viability** of the article you provided to me. Currently, I'm using **PHP 7.3.3** So, rather than providing links to some outdated article I kindly request you to provide good answer to my doubts. — PHPLover, Mar 27 '19 at 19:27
Nothing has fundamentally changed in PHP string handling *nor the basics of encodings.* It’s still all perfectly fine information. — deceze, Mar 27 '19 at 19:29
@PHPNut You would be well-advised to not so easily dismiss anything that deceze says. You'll be hard-pressed to find many more knowledgeable in PHP around here. — Patrick Q, Mar 27 '19 at 19:30
@deceze : I guess creators of PHP have added the **multibyte string handling support** from **PHP7** which was missing in **PHP5**. Also, till **PHP5** they were assuming the ISO-8859-1 as the default character set but now from **PHP7** they assume **UTF-8** character set. I think these major changes they have made to **PHP7** which were missing in **PHP5**. Am I right? — PHPLover, Mar 27 '19 at 19:32
I think some of this `multibyte string handling support` - died with PHP6. — ArtisticPhoenix, Mar 27 '19 at 19:35
You will have to point to specific manual pages that document what you think has significantly changed in these areas. — deceze, Mar 27 '19 at 19:35
And I told you a while ago that the big Unicode plans of PHP didn’t pan out, PHPNut: https://stackoverflow.com/a/53722559/476 — deceze, Mar 27 '19 at 19:38
@PHPNut "now from PHP7 they assume UTF-8 character set" No. The manual page that you linked says "In PHP 5.6 and later, the default_charset configuration option is used as the default value. PHP 5.4 and 5.5 will use UTF-8 as the default.". So as long as your repeated references to "PHP5" (without a subversion) are at least 5.4, then the default would have been UTF-8. — Patrick Q, Mar 27 '19 at 19:40

score 2 · Answer 1 · answered Mar 27 '19 at 20:52

There could be few reason the string might contain invalid code units. To understand why that might be you first need to understand what a code unit is and how is it different from code point.

Unicode standard defines a list of code points, which in simple terms means that every character which you would need should have a well defined ID. Therefore a code point is a unique identifier for the particular character in the Unicode standard. It defines 1,114,112 code points on 17 planes.

Unicode can be implemented by different character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16, and UCS-2, a precursor of UTF-16. Each encoding will generate a different code unit to encode a particular code point.

The maximum number you can store in a byte is 255 and you can see that the number of code points well exceeds the maximum number you can store in one byte. This is where multi-byte encodings mentioned above come in. I recommend to read more about them in free time, but for the sake of simplicity I will be talking about UTF-8 only from now on.

UTF-8 is a variable length encoding. This means that to encode letter A for example you only need 1 byte as opposed to for example which uses 4 bytes. In order to know which byte in a string sequence is part of multi-byte sequence you need prefix codes. The first byte indicates the number of bytes in the sequence. All bytes make up the code unit for that character. An incorrect character will not be decoded if a stream ends mid-sequence. A single byte from a code unit on its own is an invalid code unit; it cannot be decoded to point to a correct Unicode code point. Take a look at what happens after 7F. If you compare this to the PHP source code you can clearly see that if you encounter a byte in range 0x80 < x < 0xc2 it means that this is an invalid code unit, unless it was preceded by prefix code byte. https://en.wikipedia.org/wiki/UTF-8#Description

Thanks to UTF-16 some code points can also be an invalid code unit. These are called surrogates and on their own don't represent a Unicode character.

A string can be malformed for many different reason, but it is possible to have illegal byte sequences i.e. code units

Some examples of invalid code unit sequences would be:

"\xED\x9F\xC0" - surrogate
"\x80"
"\xC2\x79"
"\xC3\xC0" and so on...

What do actually mean by 'invalid code unit sequences' in PHP? How do 'invalid code unit sequences' work in htmlspecialchars()? Need examples of it

1 Answers1