PHP DOMDocument nodeValue dumps literal UTF-8 characters instead of encoded

Question

I am experiencing an issue similar to this question:

nodeValue from DomDocument returning weird characters in PHP

The root cause that I have found can be mimicked with mb_convert_encoding()

In my unit tests, this finally caught the issue:

$test = mb_convert_encoding('é', "UTF-8");
$this->assertTrue(mb_check_encoding($test,'UTF-8'),'data is UTF-8');
$this->assertTrue($this->rw->checkEncoding($test,'UTF-8'),'data is UTF-8');
$this->assertIdentical($test,html_entity_decode('&Atilde;&copy;',ENT_QUOTES,'UTF-8'),'values match');

The raw value of the UTF-8 data appears to be coming over, and the base codepage of the system upon which PHP is running is most likely not UTF-8.

All the way up until parsing (with an HTML5lib implementation that dumps to DOMDocument) the strings stay clean, UTF-8 friendly. Only at the point of pulling data using

$span->nodeValue

do I see a failure in encoding stability.

My guess is that the htmlentities catch for the domdocument export to nodeValue uses an encoding converter, but disregards the inline encoding value.

Given that my issue is with HTML5, I figured it would be directly related to the newness of the implementation, but it appears to be a broader issue. I haven't been able to find any information on this issue specific to DOMDocument via searches, other than the question mentioned at the beginning.

UPDATE

In the name of moving forward, I have switched over from HTML5lib and DOMDocument over to Simple HTML DOM, and it exports cleanly escaped html which I can then parse back into the correct UTF-8 entities.

Also, one function I did not try was

utf8_decode

So that may be a solution for anyone else experiencing this issue. It solved a related issue I was experiencing with AJAX/PHP, solution found on this blog post from 2009: Overcoming AJaX UTF-8 Encoding Limitation (in PHP)

Welcome to SO! Some additional questions. what do you mean by "raw value", can you show some examples? What encoding is your script file in that contains the `é`? Can you show the correct, and the failing value(s)? What output encoding are you using on your page? — Pekka, Mar 03 '11 at 20:34
By "raw value" I mean that the value eventually rendering is the result of this function call `html_entity_decode('Ã©',ENT_QUOTES,'UTF-8')` So, essentially in the source html there is a span that contains a word with the character é, when I extract the contents of that span using `$span->nodeValue` where `$span` is the result of a DOMDocument `getElementsByTagName()`. I'm trying to use UTF-8 everywhere, meta is set to UTF-8, as per this html: ` ` — Dave Espionage, Mar 03 '11 at 22:03
So, the html page displays `é` and the result of `nodeValue` is the rendered equivalend of `Ã©` which, from what I've read, is the equivalent of what happens when `mb_check_encoding('é','UTF-8')` is run on a system without a default encoding of UTF-8 — Dave Espionage, Mar 03 '11 at 22:18

score 3 · Answer 1 · edited May 03 '12 at 20:58

3

Just used utf8_decode on a nodeValue and it indeed kinda worked, had the problem with special characters not displaying correctly.

However, some characters still remain problematic, such as the simple quote ' and a few others (œ for example)

So using $element->nodeValue will not work, but utf8_decode($element->nodeValue) will - PARTLY.

edited May 03 '12 at 20:58

Nate

30,286
23
113
184

answered May 03 '12 at 09:44

Patrick

31
2

Ah yes, in this case I was working with French accents, so that's where it became a major issue, all "standard" alphabet entities were fine, but anything that ventured into UTF-8 only territory became improperly converted. I'm wondering if there's a server setting involved somewhere? – Dave Espionage Oct 29 '12 at 15:24

score 1 · Answer 2 · answered May 03 '12 at 09:57

The functions utf8_decode and utf8_encode are not very well named. They literally convert from utf-8 to iso-8859-1 and from iso-8859-1 to utf-8 respectively.

mb_convert_encoding when called with just utf-8 as argument will normally be similar to using the function utf8_encode. (Normally being unless you changed the internal code page, which you probably - hopefully - didn't).

Most of PHP's functions expect strings to be iso-8859-1 encoded. However, libxml (Which is the underlying library of php's xml parsing libraries) expects strings to be utf-8. As such, you can easily end up with mangled encodings, if you aren't cautious.

As for your test, the first line may be deceptive. Since you have a literal é in your script, the test would change depending on which encoding you have saved the file in. Check your text editor for that.

Hope that clarifies a bit.

I learned a whooooole lot about what those functions do when I was working on this originally :) I did not change the internal code page (saw the warnings about that one.) Also worth noting, the test code you see in the question is probably the fifth permutation. I tried several different ways of saving the files (ensuring UTF-8, windows native) and triggering that character (hex, ascii, html entity), what I posted was the last-ditch attempt. Makes me want to test that code again! Thanks for the thoughts. — Dave Espionage, Jun 04 '12 at 15:48

PHP DOMDocument nodeValue dumps literal UTF-8 characters instead of encoded

2 Answers2