3

I'm using DOMDocument and SimpleXMLElement to create a formatted XML file. While this all works, the resulting file is saved as ASCII, not as UTF-8. I can't find an answer as to how to change that.

The XML is created as so:

    $XMLNS = "http://www.sitemaps.org/schemas/sitemap/0.9";
    $rootNode = new \SimpleXMLElement("<?xml version='1.0' encoding='UTF-8'?><urlset></urlset>");
    $rootNode->addAttribute('xmlns', $XMLNS);

    $url = $rootNode->addChild('url');
    $url->addChild('loc', "Somewhere over the rainbow");

    //Turn it into an indented file needs a DOMDocument...
    $dom = dom_import_simplexml($rootNode)->ownerDocument;
    $dom->formatOutput = true;

    $path = "C:\\temp";

    // This saves an ASCII file
    $dom->save($path.'/sitemap.xml');

The resulting XML looks like this (which is as it should be I think):

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>Somewhere over the rainbow</loc>
  </url>
</urlset>

Unfortunately the file is ASCII encoded and not UTF-8.

How do I fix this?

Edit: Don't use notepad++ to check encoding

I've got it to work now thanks to the accepted answer below. There's one note: I used Notepad++ to open the file and check the encoding. However, when I re-generated the file, Notepad++ would update its tab and for some reason indicate ANSI as the encoding. Closing and reopening the same file in Notepad++ would then again indicate UTF-8 again. This caused me a load of confusion.

Coo
  • 1,842
  • 3
  • 19
  • 37

2 Answers2

4

I think there are a couple of things going on here. For one, you need:

$dom->encoding = 'utf-8';

But also, I think we should try creating the DOMDocument manually specifying the proper encoding. So:

<?php

$XMLNS = "http://www.sitemaps.org/schemas/sitemap/0.9";
$rootNode = new \SimpleXMLElement("<?xml version='1.0' encoding='UTF-8'?><urlset></urlset>");
$rootNode->addAttribute('xmlns', $XMLNS);

$url = $rootNode->addChild('url');
$url->addChild('loc', "Somewhere over the rainbow");

// Turn it into an indented file needs a DOMDocument...
$domSxe = dom_import_simplexml($rootNode)->ownerDocument;

// Set DOM encoding to UTF-8.
$domSxe->encoding = 'UTF-8';

$dom = new DOMDocument('1.0', 'UTF-8');
$domSxe = $dom->importNode($domSxe, true);
$domSxe = $dom->appendChild($domSxe);

$path = "C:\\temp";

$dom->formatOutput = true;
$dom->save($path.'/sitemap.xml');

Also ensure that any elements or CData you're adding are actually UTF-8 (see utf8_encode()).

Using the example above, this works for me:

php > var_dump($utf8);
string(11) "ᙀȾᎵ⁸"

php > $XMLNS = "http://www.sitemaps.org/schemas/sitemap/0.9";
php > $rootNode = new \SimpleXMLElement("<?xml version='1.0' encoding='UTF-8'?><urlset></urlset>");
php > $rootNode->addAttribute('xmlns', $XMLNS);
php > $url = $rootNode->addChild('url');

php > $url->addChild('loc', "Somewhere over the rainbow $utf8");

php > $domSxe = dom_import_simplexml($rootNode);
php > $domSxe->encoding = 'UTF-8';
php > $dom = new DOMDocument('1.0', 'UTF-8');
php > $domSxe = $dom->importNode($domSxe, true);
php > $domSxe = $dom->appendChild($domSxe);
php > $dom->save('./sitemap.xml');


$ cat ./sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>Somewhere over the rainbow ᙀȾᎵ⁸</loc></url></urlset>
Will
  • 24,082
  • 14
  • 97
  • 108
  • 1
    Your code changes the descripting XML element in the output to `` (Notice that the `utf-8` now became lowercase) but it doesn't change the actual encoding of the file which is still detected as ANSI. – Coo Dec 30 '15 at 03:21
  • 1
    Lowercase is fine. But, did you include the `$dom->encoding = 'utf-8';` and the `$dom->saveXML()`? And are you sure the elements you're adding are in UTF-8? Can you show a more realistic example of adding some UTF-8 data to the DOM tree? – Will Dec 30 '15 at 03:25
  • 1
    I did include the `$dom->encoding = 'utf-8';`, but I missed the `saveXML()`. Tho `saveXML()` gives me an error: `Error Type: 4096 Message: Argument 1 passed to DOMDocument::saveXML() must be an instance of DOMNode, string given` (Which is to be expected, as `saveXML()` dumps the XML to a string according to the docs) edit: let me play with some examples to give you. – Coo Dec 30 '15 at 03:30
  • Ok I updated my example a bit trying a different method. This seems to work in my environment. But if you could show an example of adding a UTF-8 string to the document that would help. – Will Dec 30 '15 at 03:34
  • And you're right about the `saveXML()`, `save()` will work properly. – Will Dec 30 '15 at 03:44
  • 1
    I got it to work your code! Thanks! There's something else that played a part, I used Notepad++ to check the encoding, but there seems to be a bug in it that when Notepad++ automatically updates the file after detecting a change, it says the encoding is ANSI. Close the file and reopen and it will indicate UTF-8 again. That took me some fiddling. – Coo Dec 30 '15 at 03:56
  • Ah, weird, yeah editor encoding can definitely be a factor. Btw, it's "ASCII" not "ANSI" unless we're talking about something else :) Glad we got it figured out, thanks! – Will Dec 30 '15 at 08:45
-1

Your data must not be in UTF-8. You can convert it like so:

utf8_encode($yourData);

Or, maybe:

iconv('ISO-8859-1', 'UTF-8', $yourData)
Clay
  • 4,700
  • 3
  • 33
  • 49