0

I use simpleXML to process xml file. It has Cyrillic characters. I also use dom_import_simplexml, importNode and appendChild to copy trees from file to file and place to place. At the end of processing I do print_r of resulting simpleXmlElement and everything is ok. But I also do asXml('outputfile.xml') and something strange is going on: all cyrillic characters that was not wrapped with CDATA (some tags bodies and all attributes) change to their unicode code.

For example, the output of print_r (just a fragment):

SimpleXMLElement Object ( [@attributes] => Array 
             ( [NAME] => Государственный аппарат и     механизм 
               [COSTYES] => 3.89983579639 [COSTNO] => 0 
               [ID] => 9 )
           [COMMENTYES] => Вы совершенно         правы. 
          [COMMENTNO] => Нет, Вы ошиблись. ) ) )

But in file that asXml generates, i get something like this:

<QUEST NAME="&#x422;&#x435;&#x43E;&#x440;&#x438;&#x44F;#x434;&#x432;&#x443;&#x445;&#x43C;&#x435;&#x447;&#x435;&#x439;"     
    style="educ" ID="1">
  <DESC><![CDATA[Теория происхождения государства, известная как теория "двух мечей" [2, с.40], 
    представляет из себя...
  ]]></DESC>`

I set utf-8 locale everywhere it's possible, googled every combination of words "simplexml, unicode, cyrillic, asXml, etc" but nothing worked.

UPD Looks like some function used does htmlentities(). So, thanks to voitcus, the solution is to use html_entity_decode() as adviced here.

Cœur
  • 37,241
  • 25
  • 195
  • 267
ba3a
  • 330
  • 1
  • 6
  • 17
  • 1
    Please read the discussion (comments) in the manual: http://www.php.net/manual/en/simplexmlelement.asxml.php – Voitcus Aug 01 '13 at 12:38
  • 1
    For example [this one](http://www.php.net/manual/en/simplexmlelement.asxml.php#107137) – Voitcus Aug 01 '13 at 12:38
  • Thanks, that worked. I wonder now, how I didn't come to this solution by my own. – ba3a Aug 01 '13 at 13:12

1 Answers1

3

I wonder you might not declare encoding when you imported xml document at first. The following two give you different output.

$simplexml = simplexml_load_string('<QUEST NAME="Государственный" />');
if (!$simplexml) { exit('parse failed'); }
print_r($simplexml->asXml());

$simplexml = simplexml_load_string('<?xml version="1.0" encoding="UTF-8"?><QUEST NAME="Государственный" />');
if (!$simplexml) { exit('parse failed'); }
print_r($simplexml->asXml());

SimpleXMLElement object knows its own encoding from the original xml declaration, and if it was not declared, it generates numerical character references for safety, I guess.

akky
  • 2,818
  • 21
  • 31
  • The encoding was declared in xml file and the same encoding was set as locale. The problem was in `htmlentities`, see the comments to my post and update in it. – ba3a Aug 02 '13 at 08:54
  • Please run my sample. The second one does not encode them even without CDATA. You may need to make the shortest code to reproduce your problem which doing importNode/appendChild thingy. – akky Aug 02 '13 at 11:38