1

I'm little bit confusing about utf-8 in XML and JSON Schema

I have following array

$array = array(
    array('name'=>'abc', 'text'=>'اسلسصثصض صثصهخه عه☆anton & budi☆' ),
    array('name'=>'xyz', 'text'=>'nice' ),
);

when i convert it to XML it give me this result

<?xml version="1.0"?>
<response>
  <item>
    <name>abc</name>
    <text>&#x627;&#x633;&#x644;&#x633;&#x635;&#x62B;&#x635;&#x636; &#x635;&#x62B;&#x635;&#x647;&#x62E;&#x647; &#x639;&#x647;&#x2606;anton '&lt;&amp;&gt;' budi&#x2606;</text>
  </item>
  <item>
    <name>xyz</name>
    <text>nice</text>
  </item>
</response>

Why the result is not like following :

<?xml version="1.0"?>
<response>
  <item>
    <name>abc</name>
    <text>اسلسصثصض صثصهخه عه☆anton &amp; budi☆</text>
  </item>
  <item>
    <name>xyz</name>
    <text>nice</text>
  </item>
</response>

And When i convert it to JSON it will give me result :

[
  {
    "name": "abc",
    "text": "\u0627\u0633\u0644\u0633\u0635\u062b\u0635\u0636 \u0635\u062b\u0635\u0647\u062e\u0647 \u0639\u0647\u2606anton '<&>' budi\u2606"
  },
  {
    "name": "xyz",
    "text": "nice"
  }
]

and why not like this :

[
  {
    "name": "abc",
    "text": "اسلسصثصض صثصهخه عه☆anton &amp; budi☆"
  },
  {
    "name": "xyz",
    "text": "nice"
  }
]

is that any way to use utf-8 character inside xml or json ? or that's are the standard ?

Kevin Ji
  • 10,479
  • 4
  • 40
  • 63
Ahmad
  • 4,224
  • 8
  • 29
  • 40
  • Have you saved the file as utf-8 http://www.w3schools.com/xml/xml_encoding.asp – Sam Nov 21 '11 at 05:43
  • This looks like some issue with your conversion tools. The "why is it not like this" json sample is valid json. – Raymond Hettinger Nov 21 '11 at 05:43
  • It's fine and valid XML/JSON. Your library just escapes the characters as defined by the XML/JSON standard. – deceze Nov 21 '11 at 05:46
  • @Sam88 : how to use xml as utf-8 encoding in php ? – Ahmad Nov 21 '11 at 05:46
  • @Ahmad see the following http://stackoverflow.com/questions/217089/looking-for-a-utf-8-text-editor – Sam Nov 21 '11 at 05:51
  • Try this [http://stackoverflow.com/questions/869650/getting-simplexmlelement-to-include-the-encoding-in-output][1] [1]: http://stackoverflow.com/questions/869650/getting-simplexmlelement-to-include-the-encoding-in-output – Dinuka Thilanga Nov 21 '11 at 06:00

1 Answers1

1

It's probably for the sake of diagnostics and a better likelihood of being transported correctly - systems are generally pretty good at transporting ASCII, but many systems aren't written well when it comes to other encodings.

It should, of course, be possible to transport the UTF-8 encoded form correctly, but I suspect the encoder you're using is simply being conservative. It means you don't need to make sure you get it right at the HTTP level, for example. The main thing is that it will still give the right text overall. Is this causing you some actual problem, or were you just surprised by the use of escaping?

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • It should only be necessary to use `htmlspecialchars()` for the XML (the big five). `XMLWriter()` should be sufficient - someone correct me if I'm wrong. – Yzmir Ramirez Nov 21 '11 at 05:59
  • @YzmirRamirez: I'm not sure what you mean - but I'm suggesting that even though the escaping here is unnecessary, it shouldn't be *harmful* beyond making the transport a little bigger. – Jon Skeet Nov 21 '11 at 06:11
  • correct it shouldn't be necessary and it doesn't seem harmful. I wonder if whatever encoder the OP is using if it has an option to write `<![CDATA[...]]>` tags where you wouldn't need to escape the content to the level it is now. – Yzmir Ramirez Nov 21 '11 at 06:45