First, want I need: I simply want to get a substring STARTING from '<' (the position of the first '<' character.)
<php
mb_internal_encoding("UTF-8");
$s = iconv('UTF-8', 'cp874', "เรา <l>AB");
$lt = mb_strpos($s, "<");
$newString = mb_substr($s, $lt, 99999);
?>
mb_strpos
seemed to be the problem, so I tried to "low level debug" it.
Before someone complains, $s
came from a UTF8 DB read, and if I simply print it, it works, it contains the same characters that you see there. Also, if I "hexprint" it, it matches with this one.
But the above code simply doesn't give me the right position.
mb_internal_encoding("UTF-8");
// "s" originally came from a DB
// but the hexprint is EXACTLY the same so...
$s = iconv('UTF-8', 'cp874', "เรา <l>AB");
$lt = mb_strpos($s, "<");
$os = utf8_to_hex($s);
echo "lt=$lt, [$os]<br>";
$newString = mb_substr($s, $lt, 99999);
echo "New string: [".utf8_to_hex($newString)."]";
This is the output:
lt=4, [E0C3D23C6C3E4142]
New string: [3E4142]
How can lt
be 4? Shouldn't it be 3? Then, with lt being 4, mb_strpos is "correct" in its own wrongness, but that behavior messes up all my substring calculations.
Is there a better way to do it? It's driving me mad.
Again: I simply need a SUBSTRING of an utf8 string UNTIL (not including) the first '<' character (or the opposite, a substring FROM the first '<' until the end...)
In needed, I grabbed the "utf8tohex" function from SO, here it is:
function utf8_to_hex($string) {
$hex = '';
for ($i = 0; $i < strlen($string); $i++) {
$ord = ord($string[$i]);
if ($ord < 128) {
$hex .= dechex($ord);
} else if ($ord < 224) {
$hex .= substr(dechex($ord), -2);
$i++;
} else if ($ord < 240) {
$hex .= substr(dechex($ord), -2);
$i++;
$hex .= substr(dechex(ord($string[$i])), -2);
} else {
$hex .= substr(dechex($ord), -2);
$i++;
$hex .= substr(dechex(ord($string[$i])), -2);
$i++;
$hex .= substr(dechex(ord($string[$i])), -2);
}
}
return strtoupper($hex);
}