2

So I'm trying to parse HTML pages and looking for paragraphs (<p>) using get_elements_by_tag_name('p');

The problem is that when I use $element->nodeValue, it's returning weird characters. The document is loaded first into $html using curl then loading it into a DOMDocument.

I'm sure it has to do with charsets.

Here's an example of a response: "aujourd’hui".

Thanks in advance.

Syscall
  • 19,327
  • 10
  • 37
  • 52
Elie
  • 6,915
  • 7
  • 31
  • 35
  • what is the encoding of the html page in this particular example? – Anurag Jan 08 '10 at 03:01
  • possible duplicate of [PHP DOMDocument loadHTML not encoding UTF-8 correctly](http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly) – cmbuckley Feb 11 '13 at 10:15

4 Answers4

7

I had the same issues and now noticed that loadHTML() no longer takes 2 parameters, so I had to find a different solution. Using the following function in my DOM library, I was able to remove the funky characters from my HTML content.

private static function load_html($html)
{
    $doc = new DOMDocument;
    $doc->loadHTML('<?xml encoding="UTF-8">' . $html);

    foreach ($doc->childNodes as $node)
        if ($node->nodeType == XML_PI_NODE)
            $doc->removeChild($node);

    $doc->encoding = 'UTF-8';

    return $doc;
}
stagl
  • 501
  • 3
  • 18
3

Apparently for me none of the above worked, finally I've found the following:

// Create a DOMDocument instance 
$doc = new DOMDocument();

// The fix: mb_convert_encoding conversion
$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

Source and more info

3

I fixed this by forcing conversion to UTF-8 even though the original text was UTF-8:

$text = iconv("UTF-8", "UTF-8", $text);
$dom = new SmartDOMDocument();
$dom->loadHTML($webpage, 'UTF-8');
.
.
echo $node->nodeValue;

PHP is wierd :)

Mandar Limaye
  • 1,900
  • 16
  • 24
1

This is an encoding issue. try explicitly setting the encoding to UTF-8.

this should help: http://devzone.zend.com/article/8855

prodigitalson
  • 60,050
  • 10
  • 100
  • 114