1

A related question is Preventing DOMDocument::loadHTML() from converting entities but it did not yield a solution.

This code:

$html = "<span>&#x1F183;&#x1F174;&#x1F182;&#x1F183;</span>";
$doc = new DOMDocument;
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadhtml($html);
foreach ($doc->getElementsByTagName('span') as $node)
{
    var_dump($node->nodeValue);
    var_dump(htmlentities($node->nodeValue));
    var_dump(htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)));
}

Produces this HTML:

string(16) ""
string(16) ""
string(0) ""

But what I want is &#x1F183;&#x1F174;&#x1F182;&#x1F183;

I am running PHP Version 5.6.29 and ini_get("default_charset") returns UTF-8

Community
  • 1
  • 1
ParoX
  • 5,685
  • 23
  • 81
  • 152

1 Answers1

0

After reading more on http://php.net/manual/en/function.htmlentities.php I noticed it doesn't encode all unicode. Someone wrote superentities in the comments but that function seem to not work for me. The UTF8entities function did.

Here are two functions I modified from the comment section and the code, while not exactly what I wanted it does give me html encoded values.

$html = "<span>&#x1F183;&#x1F174;&#x1F182;&#x1F183;</span>";
$doc = new DOMDocument;
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadhtml($html);
foreach ($doc->getElementsByTagName('span') as $node)
{
    var_dump(UTF8entities($node->nodeValue));
}


function UTF8entities($content="") {        
    $characterArray = preg_split('/(?<!^)(?!$)/u', $content );  // return array of every multi-byte character
    foreach ($characterArray as $character) {
        $rv .= unicode_entity_replace($character);
    }
    return $rv;
}

function unicode_entity_replace($c) { //m. perez 
    $h = ord($c{0});    
    if ($h <= 0x7F) { 
        return $c;
    } else if ($h < 0xC2) { 
        return $c;
    }

    if ($h <= 0xDF) {
        $h = ($h & 0x1F) << 6 | (ord($c{1}) & 0x3F);
        $h = "&#" . $h . ";";
        return $h; 
    } else if ($h <= 0xEF) {
        $h = ($h & 0x0F) << 12 | (ord($c{1}) & 0x3F) << 6 | (ord($c{2}) & 0x3F);
        $h = "&#" . $h . ";";
        return $h;
    } else if ($h <= 0xF4) {
        $h = ($h & 0x0F) << 18 | (ord($c{1}) & 0x3F) << 12 | (ord($c{2}) & 0x3F) << 6 | (ord($c{3}) & 0x3F);
        $h = "&#" . $h . ";";
        return $h;
    }
}

Returns this:

string(36) "&#127363;&#127348;&#127362;&#127363;"

ParoX
  • 5,685
  • 23
  • 81
  • 152