How to get an integer from an utf8mb4 character ( aka "\u{D83D}\u{DE00}") in PHP

Question

I want to write a saslprep algorithm with php (I know there is a lib, I want to do it myself). One of my unit tests failes because the test vector "\u{D83D}\u{DE00}" aka fails to convert to code points (array of integer).

echo mb_ord("\u{D83D}\u{DE00}","UTF-32LE");

failes returning false

iconv("UTF-8","UTF-32LE","\u{D83D}\u{DE00}");

failes

The expected result is 128512

Does this answer your question? [Output UTF-16? A little stuck](https://stackoverflow.com/questions/3506988/output-utf-16-a-little-stuck) — JosefZ, Feb 24 '22 at 10:53
I'll give it a try. I need utf32LE, the answer is about utf16BE. — Richard Burkhardt, Feb 24 '22 at 12:25
Its not the answer, because it doesn't use the php's version of converting the codes. I used `\u{D83D}\u{DE00}` and they used `\uD83D\uDE00` which is basically a text string. — Richard Burkhardt, Feb 27 '22 at 22:24

JosefZ · Answer 1 · 2022-02-27T13:47:42.430

At first, let's analyze php way of encoding:

<?php
$sp = "ař\u{05FF}€\u{D83D}\u{DE00}";
echo 'Current PHP version: ' . phpversion() .
     '; internal_encoding: ' . mb_internal_encoding() . PHP_EOL . PHP_EOL;
echo $sp . ' (' . strval(mb_strlen($sp)) . ' chars in ' . 
                  strval(strlen($sp)) . ' bytes)' . PHP_EOL . PHP_EOL;

$sp_array = mb_str_split($sp, 1, mb_internal_encoding() );
echo strval( sizeof($sp_array)) . PHP_EOL;
foreach ($sp_array as $char) {
    $cars = bin2hex($char);
    echo str_pad($cars, 8, ' ', STR_PAD_LEFT) .
        ' (' . strval(strlen($char)) . ' bytes) ' .
        str_pad(implode(",", mb_str_split($cars, 2) ), 12) .   // WTF-8
        str_pad('->' . strval(mb_ord($char)) . '<-', 10) . 
        ' 0x' . str_pad( dechex( mb_ord($char)), 6) .
        $char . "\t" . IntlChar::charName($char) . PHP_EOL;
}
?>

Output: shows that surrogate code points are encoded as WTF-8 (Wobbly Transformation Format − 8-bit):

71249878parsing.php

Current PHP version: 8.1.2; internal_encoding: UTF-8

ař׿€���� (7 chars in 18 bytes)

7
      61 (1 bytes) 61          ->97<-     0x61    a     LATIN SMALL LETTER A
    c599 (2 bytes) c5,99       ->345<-    0x159   ř     LATIN SMALL LETTER R WITH CARON
    d7bf (2 bytes) d7,bf       ->1535<-   0x5ff   ׿
  e282ac (3 bytes) e2,82,ac    ->8364<-   0x20ac  €     EURO SIGN
f09f9880 (4 bytes) f0,9f,98,80 ->128512<- 0x1f600     GRINNING FACE
  eda0bd (3 bytes) ed,a0,bd    -><-       0x0     ��
  edb880 (3 bytes) ed,b8,80    -><-       0x0     ��

Now we can write the following functions and combine them to get desired number:

function CodepointFromWTF8: decode from well-formed WTF-8 to code points;
function CodepointFromSurrogates: decode from potentially ill-formed UTF-16 to code points. The following formula should suffice for a well-formed UTF-16 surrogate pair: codepoint = 0x10000 + ((high - 0xD800) << 10) + (low - 0xDC00)

BTW: Tested using "ař\u{05FF}€\u{D83D}\u{DE00}" sample string where characters are as follows (column CodePoint contains Unicode (U+hhhh) and WTF-8 bytes and column Description contains surrogates in parentheses, if apply):

Char CodePoint                      Description
---- ---------                      -----------
   a {U+0061, 0x61}                 Latin Small Letter A
   ř {U+0159, 0xC5,0x99}            Latin Small Letter R With Caron
   ׿ {U+05FF, 0xD7,0xBF}            Undefined
   € {U+20AC, 0xE2,0x82,0xAC}       Euro Sign
   {U+1F600, 0xF0,0x9F,0x98,0x80} GRINNING FACE (0xd83d,0xde00)
     {U+D83D, 0xED,0xA0,0xBD}       Non Private Use High Surrogate
     {U+DE00, 0xED,0xB8,0x80}       Low Surrogate

Edit

Here's full simplified solution (I don't follow Userland Naming Guide arbitrarily mixing snake_case, camelCase and PascalCase rules, sorry):

<?php

function CodepointFromWTF8($ch) {
    $Bytes = str_split($ch);
    switch (strlen($ch)) {
    case 1:
        $retval = ord($Bytes[0]) & 0x7F;
        break;
    case 2:
        $retval = (ord(($Bytes[0]) & 0x1F) << 6) +
                  (ord($Bytes[1]) & 0x3F);
        break;
    case 3:
        $retval = ((ord($Bytes[0]) & 0x0F) << 12) +
                  ((ord($Bytes[1]) & 0x3F) << 6)  +
                   (ord($Bytes[2]) & 0x3F);
        break;
    case 4:
        $retval = ((ord($Bytes[0]) & 0x07) << 18) +
                  ((ord($Bytes[1]) & 0x3F) << 12) +
                  ((ord($Bytes[2]) & 0x3F) << 6)  +
                   (ord($Bytes[3]) & 0x3F);
        break;
    default:
        $retval = 0;
    }
    return $retval;
}

Function IsHighSurrogate ($cp) {
    return ($cp >= 0xD800 and  $cp <= 0xDBFF);
}

Function IsLowSurrogate ($cp) {
    return ($cp >= 0xDC00 and  $cp <= 0xDFFF);
}

Function CodepointFromSurrogates($Surrogates) {
    $cps = array();
    $cpsc = count($Surrogates);
    for ( $ii = 0; $ii < $cpsc; $ii+=1 ) {
        if ( ( IsHighSurrogate( $Surrogates[$ii]) ) and
             ( 1 + $ii ) < $cpsc and
             ( IsLowSurrogate( $Surrogates[1+$ii]) ) )
        {
            array_push($cps, 0x10000 + (
                ($Surrogates[$ii] - 0xD800) << 10) + ($Surrogates[$ii+1] - 0xDC00) );
            $ii+=1;
        } else {
            array_push($cps, $Surrogates[$ii] );
        }
    }
    return $cps;
}

$sp = "ař\u{05FF}€\u{D83D}\u{DE00}";
echo 'Current PHP version: ' . phpversion() .
     '; internal_encoding: ' . mb_internal_encoding() . PHP_EOL . PHP_EOL;
echo $sp . ' (' . strval(mb_strlen($sp)) . ' chars in ' . 
                  strval(strlen($sp)) . ' bytes)' . PHP_EOL . PHP_EOL;

$sp_array = mb_str_split($sp, 1, mb_internal_encoding() );
echo strval( sizeof($sp_array)) . PHP_EOL;
$sp_codepoints = array();
foreach ($sp_array as $char) {
    $cars = bin2hex($char);
    $charord = mb_ord($char);
//     echo str_pad($cars, 8, ' ', STR_PAD_LEFT) .
//         ' (' . strval(strlen($char)) . ' bytes) ' .
//         str_pad(implode(",", mb_str_split($cars, 2) ), 12) .   // WTF-8
//         str_pad('->' . strval($charord) . '<-', 10) . 
//         ' 0x' . str_pad( dechex( $charord), 6) .
//         str_pad( gettype($charord), 8) .
//         $char . "\t" . IntlChar::charName($char) . 
//         PHP_EOL;
    if (gettype($charord) == "boolean") {
        $charord = CodepointFromWTF8($char);
    }
    array_push($sp_codepoints, $charord);
}
print_r($sp_codepoints);

$sp_codepoints_real = CodepointFromSurrogates($sp_codepoints);
print_r($sp_codepoints_real);
?>

Result: 71249878.php

Current PHP version: 8.1.2; internal_encoding: UTF-8

ař׿€���� (7 chars in 18 bytes)

7
Array
(
    [0] => 97
    [1] => 345
    [2] => 1535
    [3] => 8364
    [4] => 128512
    [5] => 55357
    [6] => 56832
)
Array
(
    [0] => 97
    [1] => 345
    [2] => 1535
    [3] => 8364
    [4] => 128512
    [5] => 128512
)

How to get an integer from an utf8mb4 character ( aka "\u{D83D}\u{DE00}") in PHP

1 Answers1

Edit