PHP DomDocument - Chinese characters inside script tag malformed

Question

I'm trying to parse a simple HTML containing Chinese characters inside script tag. However, after processing by PHP DomDocument, those are converted to some weird characters.

<?php

$html = <<<EOD
<!DOCTYPE html>
<html>
    <head>
        <script>
            const str = "訂閱最新指南";
        </script>
    </head>
    <body>
    </body>
</html>
EOD;

$dom = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$dom->loadHTML($html);

// Trying different approaches to get correct output
echo $dom->saveHTMl();
echo $dom->saveHTML($dom->documentElement);
echo utf8_decode($dom->saveHTML($dom->documentElement));
echo utf8_decode($dom->saveHTML());

Output:

<!DOCTYPE html>
<html>
    <head>
        <script>
            const str = "&#35330;&#38321;&#26368;&#26032;&#25351;&#21335;";
        </script>
    </head>
    <body>
    </body>
</html>
<html>
    <head>
        <script>
            const str = "&#35330;&#38321;&#26368;&#26032;&#25351;&#21335;";
        </script>
    </head>
    <body>
    </body>
</html><html>
    <head>
        <script>
            const str = "&#35330;&#38321;&#26368;&#26032;&#25351;&#21335;";
        </script>
    </head>
    <body>
    </body>
</html><!DOCTYPE html>
<html>
    <head>
        <script>
            const str = "&#35330;&#38321;&#26368;&#26032;&#25351;&#21335;";
        </script>
    </head>
    <body>
    </body>
</html>

MaartenDev · Accepted Answer · 2021-03-07T10:39:44.160

2

Seems to working without the mb_convert_encoding:

<?php

$html = <<<EOD
<!DOCTYPE html>
<html>
    <head>
        <script>
            const str = "訂閱最新指南";
        </script>
    </head>
    <body>
    </body>
</html>
EOD;

$dom = new DOMDocument();

$dom->loadHTML($html);

echo utf8_decode($dom->saveHTML($dom->documentElement));

result:

<html>
<head><script>
            const str = "訂閱最新指南";
        </script></head>
<body>
    </body>
</html>

with mb_convert_encoding:

<?php

$html = <<<EOD
<!DOCTYPE html>
<html>
    <head>
        <script>
            const str = "訂閱最新指南";
        </script>
    </head>
    <body>
    </body>
</html>
EOD;

$dom = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$dom->loadHTML($html);

echo html_entity_decode($dom->saveHTML($dom->documentElement));

result:

<html><head><script>
            const str = "訂閱最新指南";
        </script></head><body>
    </body></html>

edited Mar 07 '21 at 10:39

answered Mar 06 '21 at 17:34

MaartenDev

5,631
5
21
33

is there any other without removing `mb_convert_encoding`, as it's required to fix many other issues why parsing – Gijo Varghese Mar 07 '21 at 04:34
Added an example with the convert encoding @GijoVarghese – MaartenDev Mar 07 '21 at 10:40
`html_entity_decode` did the trick! Thanks! – Gijo Varghese Mar 07 '21 at 13:09
I've a one more similar question if you're interested :) https://stackoverflow.com/questions/66541631/gcsesearch-converted-to-search-when-using-php-domdocument – Gijo Varghese Mar 09 '21 at 06:03
Unfortunately `html_entity_decode` also decodes string inside `code` tag which shouldn't be decoded. I've created a new post if you would like to check: https://stackoverflow.com/questions/66598623/specify-utf-8-encoding-to-phps-domdocument-without-meta-tag – Gijo Varghese Mar 12 '21 at 10:59
Did the solution provided for https://stackoverflow.com/questions/66541631/gcsesearch-converted-to-search-when-using-php-domdocument work? – MaartenDev Mar 12 '21 at 11:05
I've commented it over there. – Gijo Varghese Mar 12 '21 at 15:28

PHP DomDocument - Chinese characters inside script tag malformed

1 Answers1