At first, let's analyze php
way of encoding:
<?php
$sp = "ař\u{05FF}€\u{D83D}\u{DE00}";
echo 'Current PHP version: ' . phpversion() .
'; internal_encoding: ' . mb_internal_encoding() . PHP_EOL . PHP_EOL;
echo $sp . ' (' . strval(mb_strlen($sp)) . ' chars in ' .
strval(strlen($sp)) . ' bytes)' . PHP_EOL . PHP_EOL;
$sp_array = mb_str_split($sp, 1, mb_internal_encoding() );
echo strval( sizeof($sp_array)) . PHP_EOL;
foreach ($sp_array as $char) {
$cars = bin2hex($char);
echo str_pad($cars, 8, ' ', STR_PAD_LEFT) .
' (' . strval(strlen($char)) . ' bytes) ' .
str_pad(implode(",", mb_str_split($cars, 2) ), 12) . // WTF-8
str_pad('->' . strval(mb_ord($char)) . '<-', 10) .
' 0x' . str_pad( dechex( mb_ord($char)), 6) .
$char . "\t" . IntlChar::charName($char) . PHP_EOL;
}
?>
Output: shows that surrogate code points are encoded as WTF-8 (Wobbly Transformation Format − 8-bit):
71249878parsing.php
Current PHP version: 8.1.2; internal_encoding: UTF-8
ař€���� (7 chars in 18 bytes)
7
61 (1 bytes) 61 ->97<- 0x61 a LATIN SMALL LETTER A
c599 (2 bytes) c5,99 ->345<- 0x159 ř LATIN SMALL LETTER R WITH CARON
d7bf (2 bytes) d7,bf ->1535<- 0x5ff
e282ac (3 bytes) e2,82,ac ->8364<- 0x20ac € EURO SIGN
f09f9880 (4 bytes) f0,9f,98,80 ->128512<- 0x1f600 GRINNING FACE
eda0bd (3 bytes) ed,a0,bd -><- 0x0 ��
edb880 (3 bytes) ed,b8,80 -><- 0x0 ��
Now we can write the following functions and combine them to get desired number:
BTW: Tested using "ař\u{05FF}€\u{D83D}\u{DE00}"
sample string where characters are as follows (column CodePoint
contains Unicode (U+hhhh
) and WTF-8
bytes and column Description
contains surrogates in parentheses, if apply):
Char CodePoint Description
---- --------- -----------
a {U+0061, 0x61} Latin Small Letter A
ř {U+0159, 0xC5,0x99} Latin Small Letter R With Caron
{U+05FF, 0xD7,0xBF} Undefined
€ {U+20AC, 0xE2,0x82,0xAC} Euro Sign
{U+1F600, 0xF0,0x9F,0x98,0x80} GRINNING FACE (0xd83d,0xde00)
{U+D83D, 0xED,0xA0,0xBD} Non Private Use High Surrogate
{U+DE00, 0xED,0xB8,0x80} Low Surrogate
Edit
Here's full simplified solution (I don't follow Userland Naming Guide arbitrarily mixing snake_case, camelCase and PascalCase rules, sorry):
<?php
function CodepointFromWTF8($ch) {
$Bytes = str_split($ch);
switch (strlen($ch)) {
case 1:
$retval = ord($Bytes[0]) & 0x7F;
break;
case 2:
$retval = (ord(($Bytes[0]) & 0x1F) << 6) +
(ord($Bytes[1]) & 0x3F);
break;
case 3:
$retval = ((ord($Bytes[0]) & 0x0F) << 12) +
((ord($Bytes[1]) & 0x3F) << 6) +
(ord($Bytes[2]) & 0x3F);
break;
case 4:
$retval = ((ord($Bytes[0]) & 0x07) << 18) +
((ord($Bytes[1]) & 0x3F) << 12) +
((ord($Bytes[2]) & 0x3F) << 6) +
(ord($Bytes[3]) & 0x3F);
break;
default:
$retval = 0;
}
return $retval;
}
Function IsHighSurrogate ($cp) {
return ($cp >= 0xD800 and $cp <= 0xDBFF);
}
Function IsLowSurrogate ($cp) {
return ($cp >= 0xDC00 and $cp <= 0xDFFF);
}
Function CodepointFromSurrogates($Surrogates) {
$cps = array();
$cpsc = count($Surrogates);
for ( $ii = 0; $ii < $cpsc; $ii+=1 ) {
if ( ( IsHighSurrogate( $Surrogates[$ii]) ) and
( 1 + $ii ) < $cpsc and
( IsLowSurrogate( $Surrogates[1+$ii]) ) )
{
array_push($cps, 0x10000 + (
($Surrogates[$ii] - 0xD800) << 10) + ($Surrogates[$ii+1] - 0xDC00) );
$ii+=1;
} else {
array_push($cps, $Surrogates[$ii] );
}
}
return $cps;
}
$sp = "ař\u{05FF}€\u{D83D}\u{DE00}";
echo 'Current PHP version: ' . phpversion() .
'; internal_encoding: ' . mb_internal_encoding() . PHP_EOL . PHP_EOL;
echo $sp . ' (' . strval(mb_strlen($sp)) . ' chars in ' .
strval(strlen($sp)) . ' bytes)' . PHP_EOL . PHP_EOL;
$sp_array = mb_str_split($sp, 1, mb_internal_encoding() );
echo strval( sizeof($sp_array)) . PHP_EOL;
$sp_codepoints = array();
foreach ($sp_array as $char) {
$cars = bin2hex($char);
$charord = mb_ord($char);
// echo str_pad($cars, 8, ' ', STR_PAD_LEFT) .
// ' (' . strval(strlen($char)) . ' bytes) ' .
// str_pad(implode(",", mb_str_split($cars, 2) ), 12) . // WTF-8
// str_pad('->' . strval($charord) . '<-', 10) .
// ' 0x' . str_pad( dechex( $charord), 6) .
// str_pad( gettype($charord), 8) .
// $char . "\t" . IntlChar::charName($char) .
// PHP_EOL;
if (gettype($charord) == "boolean") {
$charord = CodepointFromWTF8($char);
}
array_push($sp_codepoints, $charord);
}
print_r($sp_codepoints);
$sp_codepoints_real = CodepointFromSurrogates($sp_codepoints);
print_r($sp_codepoints_real);
?>
Result: 71249878.php
Current PHP version: 8.1.2; internal_encoding: UTF-8
ař€���� (7 chars in 18 bytes)
7
Array
(
[0] => 97
[1] => 345
[2] => 1535
[3] => 8364
[4] => 128512
[5] => 55357
[6] => 56832
)
Array
(
[0] => 97
[1] => 345
[2] => 1535
[3] => 8364
[4] => 128512
[5] => 128512
)