3

In PHP using the built-in functions don't seem to include special and new symbols. ALL including the ones released 3 months ago. Looking to turn a string with mixed symbols such as:

δϱж ☎

into

𝕃𝕆𝕃 𝔯𝔬𝔠𝔰 𝓂𝓎 δϱж ☎

(which the browser will render the same)

I see this being done on the fly. We're talking countless symbols here. And who knows how many more in the future.

How are they achieving this? No way they really have a 1000+ key array of every single symbol and its entity?

I've hit all the related questions, no luck so far.

Hanky Panky
  • 46,730
  • 8
  • 72
  • 95
John Smith
  • 490
  • 2
  • 11

3 Answers3

3

This function will convert every character (current and future) excluding [0-9A-Za-z ] to a numeric entity. The UTF-8 character encoding is assumed:

function html_entity_encode_all($s) {
    $out = '';
    for ($i = 0; isset($s[$i]); $i++) {
        // read UTF-8 bytes and decode to a Unicode codepoint value:
        $x = ord($s[$i]);
        if ($x < 0x80) {
            // single byte codepoints
            $codepoint = $x;
        } else {
            // multibyte codepoints
            if ($x >= 0xC2 && $x <= 0xDF) {
                $codepoint = $x & 0x1F;
                $length = 2;
            } else if ($x >= 0xE0 && $x <= 0xEF) {
                $codepoint = $x & 0x0F;
                $length = 3;
            } else if ($x >= 0xF0 && $x <= 0xF4) {
                $codepoint = $x & 0x07;
                $length = 4;
            } else {
                // invalid byte
                $codepoint = 0xFFFD;
                $length = 1;
            }
            // read continuation bytes of multibyte sequences:
            for ($j = 1; $j < $length; $j++, $i++) {
                if (!isset($s[$i + 1])) {
                    // invalid: string truncated in middle of multibyte sequence
                    $codepoint = 0xFFFD;
                    break;
                }
                $x = ord($s[$i + 1]);
                if (($x & 0xC0) != 0x80) {
                    // invalid: not a continuation byte
                    $codepoint = 0xFFFD;
                    break;
                }
                $codepoint = ($codepoint << 6) | ($x & 0x3F);
            }
            if (($codepoint > 0x10FFFF) ||
                ($length == 2 && $codepoint < 0x80) ||
                ($length == 3 && $codepoint < 0x800) ||
                ($length == 4 && $codepoint < 0x10000)) {
                // invalid: overlong encoding or out of range
                $codepoint = 0xFFFD;
            }
        }

        // have codepoint, now output:
        if (($codepoint >= 48 && $codepoint <= 57) ||
            ($codepoint >= 65 && $codepoint <= 90) ||
            ($codepoint >= 97 && $codepoint <= 122) ||
            ($codepoint == 32)) {
            // leave plain 0-9, A-Z, a-z, and space unencoded
            $out .= $s[$i];
        } else {
            // all others as numeric entities
            $out .= '&#' . $codepoint . ';';
        }
    }
    return $out;
}

For decoding, the standard function html_entity_decode can be used.

Boann
  • 48,794
  • 16
  • 117
  • 146
  • When you run the string through this function is outputs like in the question? I ran this locally and got mostly diamond question marks, but on the server shows regular question marks. File encoding is UTF-8. I'm running PHP 5.4.x – John Smith Sep 19 '15 at 03:52
  • 1
    @JohnSmith Yes, the output is identical to that in your question: [https://eval.in/436085](https://eval.in/436085) – Boann Sep 19 '15 at 07:54
2

How are they achieving this? No way they really have a 1000+ key array of every single symbol and its entity?

They do in fact have a translation table and it does contain all the symbols you have in your question (and the table has more than 1500 entries :) ).

Fiddle

Hanky Panky
  • 46,730
  • 8
  • 72
  • 95
  • 1
    And _none_ of those are the new characters added. In fact, I think all characters in that list are from Unicode 2.1 (1998), and certainly not those new ones added 3 months ago to Unicode 8.0 – MSalters Sep 18 '15 at 15:23
-2

Simple: the encoding doesn't use any special knowledge. The input is a numerical character value, the output is &#<decimal-value>;.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Look at Hanky Panky's answer. There is a table for the special cases, with 1500 entries. Unicode 8.0 _alone_ [added 7700 characters](http://blog.unicode.org/2015/06/announcing-unicode-standard-version-80.html). In total there are 120.737 Unicode characters, 80x more than that table contains. – MSalters Sep 18 '15 at 15:20