0

I am using PHP's SimpleXML to process an XML file, and get this error:

Message: simplexml_load_string(): Entity: line 9: parser error : EntityRef: expecting ';'

A quick Google search reveals that this is generally caused by an un-escaped & - there's a dozen questions with that answer here on Stack Overflow. However, here's line 9 of the file:

<p>In-kingdom commentary on the following items can be found on the November LoP. https://oscar.sca.org/kingdom/kingloi.php?kingdom=9&amp;loi=4191</p>

As you can see, the & is escaped. A text search on the file reveals no other instances of &.

What am I missing?

Please note: I have no ability to edit the XML file - I must take it as it comes and only fix things in my code. I currently open the XML with the following code:

    $rawstring = file_get_contents($filename);
    $safestring = html_entity_decode($rawstring, 0, 'ISO-8859-1');
    $xmlstring = simplexml_load_string($safestring);

(the html_entity_decode is necessary as the file uses Latin-1 encoding and simplexml expects UTF-8)

Help appreciated.

jgalak
  • 137
  • 1
  • 6

1 Answers1

3

html_entity_decode() is not intended for what you appear to think it is intended for and is actually exactly what is causing your problem. As the name suggests: it decodes html entities, like &amp;, into their actual representation; in the case of &amp; => &.

If you want to convert the character encoding of the original $rawstring to ISO-8859-1 or UTF-8 you should use something like iconv() or mb_convert_encoding().

Here's an example that should work:

$rawstring = file_get_contents($filename);
$safestring = mb_convert_encoding($rawstring, 'ISO-8859-1' /*, $optionalOriginalEncoding */);
$xmlstring = simplexml_load_string($safestring);

See the list of supported encodings, as well.


However, since the original $rawstring is Latin-1, conversion to ISO-8859-1 is pointless, since Latin-1 is ISO-8859-1. You may need to convert to UTF-8, but I'm fairly certain that that's not even necessary either.

Decent Dabbler
  • 22,532
  • 8
  • 74
  • 106
  • The problem caused by html_entity_decode() certainly makes sense - hadn't thought of that. Unfortunately, mb_convert_encoding doesn't work right either. I used the line: $safestring = mb_convert_encoding($rawstring, 'UTF-8', 'ISO-8859-1'); To convert from Latin-1 (ISO-8859-1) to UTF-8, and it did not process special characters correctly. For example, 'Æ' and 'ö' in the original both became 'Ã' in the output. This XML set uses a lot of foreign characters, preserving them is important. – jgalak Jun 06 '17 at 19:33
  • How are you viewing the output? If you are viewing the output in a browser, make sure the correct HTTP `Content-Type` header is set, for instance as: `Content-Type: text/xml; charset=utf-8`. Have a look at [this question](https://stackoverflow.com/q/3272534) for more options. – Decent Dabbler Jun 06 '17 at 20:20
  • So here's the weird thing. Using this code: `$rawstring = file_get_contents($filename); $safestring = mb_convert_encoding($rawstring, 'UTF-8', 'ISO-8859-1'); $xmlstring = simplexml_load_string($safestring);` and then do `echo xmlstring->asXML();` everything looks good. But then I go through that string. I have some code that goes through the XML to pull out the relevant items. Specifically I have the following nested items (skipping irrelevant stuff): – jgalak Jun 09 '17 at 13:51
  • `foreach ($xmlstring->xpath('//item') as $item) {` `$sectionxml = simplexml_load_string($item->discussion->{'name-discussion'}->asXML());` `$namediscussion = '';` `foreach($sectionxml->xpath('//p') as $p) {` `$namediscussion = $namediscussion . strip_tags($p->asXML()) . ''; }}}` and then `echo $namediscussion`, the foreign characters are garbled. – jgalak Jun 09 '17 at 13:53
  • This is despite having content-type set. – jgalak Jun 09 '17 at 13:55