How to force XPath to use UTF8?

Question

I have an XHTML document being passed to a PHP app via Greasemonkey AJAX. The PHP app uses UTF8. If I output the POST content straight back to a textarea in the AJAX receiving div, everything is still properly encoded in UTF8.

When I try to parse using XPath

$dom = new DOMDocument();
$dom->loadHTML($raw2);
$xpath = new DOMXPath($dom);
$query = '//td/text()';
$nodes = $xpath->query($query);
foreach($nodes as $node) {
  var_dump($node->wholeText);
}

dumped strings are not utf8. How do I force DOM/XPath to use UTF8?

can you provide a (tested) example html document? – VolkerK Jul 20 '09 at 17:45 — VolkerK, Jul 20 '09 at 17:45

Lucia · Answer 1 · 2014-02-15T02:33:43.990

35

I had the same problem and I couldn't use tidy in my webserver. I found this solution and it worked fine:

$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
$dom = new DomDocument();
$dom->loadHTML($html);

edited Feb 15 '14 at 02:33

answered Jul 21 '10 at 22:45

Lucia

4,657
6
43
57

+1'd, the only suggestion is to move the second line to the top, it was confusing (at least for me). – Nabil Kadimi Jan 02 '14 at 07:06
2

I have been struggling on and off with this for over a year. Thank you so much for this. I've tried countless things that didn't work: included special classes, headers, metas, php.ini's, xml utf-8 hacks, and many more and nothing worked for my particular issue, except this. – James Huckabone Jan 19 '14 at 07:47

score 6 · Answer 2 · answered Sep 12 '16 at 14:24

A bit late in the game, but perhaps it helps someone...

The problem might be in the output, and not in the dom/xpath object itself.

If you would output the nodeValue directly, you would get corrupted characters e.g.:

Ã¬ÂÂÃ¬ÂÂ Ã«Â¹ÂÃ«Â”Â”Ã¬ÂÂ¤
ìì ë¹ë””ì¤ í°ì  íì¤

You have to load your dom object with the second param "utf-8", new \DomDocument('1.0', 'utf-8'), but still when you print the dom node list/element value you get broken characters:

echo $contentItem->item($index)->nodeValue

you have to wrap it up with utf8_decode:

echo utf8_decode($contentItem->item($index)->nodeValue) //output: 者不終朝而會，愚者可浹旬而學

Please don't add the same answer to multiple questions. Answer the best one and flag the rest as duplicates. See http://meta.stackexchange.com/questions/104227/is-it-acceptable-to-add-a-duplicate-answer-to-several-questions — Bhargav Rao, Sep 12 '16 at 14:25

score 4 · Accepted Answer · answered Jul 20 '09 at 18:05

If it is a fully fledged valid xhtml document you shouldn't use loadhtml() but load()/loadxml().

Given the example xhtml document

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
        <title>xhtml test</title>
    </head>
    <body>
        <h1>A Table</h1>
        <table>
            <tr><th>A</th><th>O</th><th>U</th></tr>
            <tr><td>Ä</td><td>Ö</td><td>Ü</td></tr>
            <tr><td>ä</td><td>ö</td><td>ü</td></tr>
        </table>
    </body>
</html>

the script

<?php
$raw2 = 'test.html';

$dom = new DOMDocument();
$dom->load($raw2);
$xpath = new DOMXPath($dom);
var_dump($xpath->registerNamespace('h', 'http://www.w3.org/1999/xhtml'));
$query = '//h:td/text()';
$nodes = $xpath->query($query);
foreach($nodes as $node) {
    foo($node->wholeText);
}


function foo($s) {
    for($i=0; $i<strlen($s); $i++) {
        printf('%02X ', ord($s[$i]));
    }
    echo "\n";
}

prints

bool(true)
C3 84 
C3 96 
C3 9C 
C3 A4 
C3 B6 
C3 BC

i.e. the output/strings are utf-8 encoded

The page I'm parsing didn't have . Used Tidy to add that and my problem is solved. — Gordon, Jul 20 '09 at 19:21
That is correct. I maintain the strong oppinion (weakly held): if it claims to be xhtml don't try to fix it; they wanted the x in front, they have to deliver. ;-) — VolkerK, Nov 25 '14 at 10:39

score 1 · Answer 4 · answered Jul 20 '09 at 17:29

1

I have not tried, but the second parameter of DOMDocument::__construct seems to be related to the encoding ; maybe that'll help you :-)

Else, there is an encoding property in DOMDocument, which is writable.

The DOMXpath beeing constructed with the DOMDocument as parameter, maybe it'll work...

answered Jul 20 '09 at 17:29

Pascal MARTIN

395,085
80
655
663

`$dom->encoding = 'utf8'` had no effect, nor did setting the encoding in `__construct()`. Possibly due to using `loadHTML()`, but I don't know. – Gordon Jul 22 '09 at 15:08
1

loadHTML() overrides the encoding set in constructor – leticia Nov 21 '12 at 21:46

score 0 · Answer 5 · answered Jun 23 '10 at 00:39

0

Struggled with similar problem (unable to force Xpath to use UTF-8 in combination with loadHTML), in the end this excellent article provided the solution: http://devzone.zend.com/article/8855

workaround:

Insert an additional section with the appropriate Content-type HTTP-EQUIV meta tag immediately following the opening tag.

answered Jun 23 '10 at 00:39

Hans

856
2
8
10

1

This link is no longer valid. Can you update it or paste the solution from that page here? – user658182 Aug 15 '17 at 04:24

How to force XPath to use UTF8?

5 Answers5

Linked