There could be few reason the string might contain invalid code units. To understand why that might be you first need to understand what a code unit is and how is it different from code point.
Unicode standard defines a list of code points, which in simple terms means that every character which you would need should have a well defined ID. Therefore a code point is a unique identifier for the particular character in the Unicode standard. It defines 1,114,112 code points on 17 planes.
Unicode can be implemented by different character encodings. The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use. The most commonly used encodings are UTF-8, UTF-16, and UCS-2, a precursor of UTF-16. Each encoding will generate a different code unit to encode a particular code point.
The maximum number you can store in a byte is 255 and you can see that the number of code points well exceeds the maximum number you can store in one byte. This is where multi-byte encodings mentioned above come in. I recommend to read more about them in free time, but for the sake of simplicity I will be talking about UTF-8 only from now on.
UTF-8 is a variable length encoding. This means that to encode letter A
for example you only need 1 byte as opposed to for example
which uses 4 bytes. In order to know which byte in a string sequence is part of multi-byte sequence you need prefix codes. The first byte indicates the number of bytes in the sequence. All bytes make up the code unit for that character. An incorrect character will not be decoded if a stream ends mid-sequence. A single byte from a code unit on its own is an invalid code unit; it cannot be decoded to point to a correct Unicode code point. Take a look at what happens after 7F. If you compare this to the PHP source code you can clearly see that if you encounter a byte in range 0x80 < x < 0xc2 it means that this is an invalid code unit, unless it was preceded by prefix code byte. https://en.wikipedia.org/wiki/UTF-8#Description
Thanks to UTF-16 some code points can also be an invalid code unit. These are called surrogates and on their own don't represent a Unicode character.
A string can be malformed for many different reason, but it is possible to have illegal byte sequences i.e. code units
Some examples of invalid code unit sequences would be:
"\xED\x9F\xC0"
- surrogate
"\x80"
"\xC2\x79"
"\xC3\xC0"
and so on...