2

What I have so far is putting the text into CDATA tags, and dealing with the possibility of CDATA endings appearing in the text by splitting it into multiple adjacent CDATAs.

I'm not sure about this, but XML parsers can fail to preserve newlines inside of CDATA tags, correct? This would mean escaping them somehow as well...

I want to generate these XML files using Perl, and parse them with C++ (using expat), Java, and C#.

Most importantly, I want the resulting files to be somewhat human-readable/modifiable. Does anyone know of any encoding scheme that fits these needs? I am using this to store data for a database, so it needs to accept arbitrary text, and upon parsing return the exact same text.

skaffman
  • 398,947
  • 96
  • 818
  • 769
Bwmat
  • 4,314
  • 3
  • 27
  • 42

3 Answers3

1

xml already supports this, you do not need to do anything special and you certainly do not need to use CDATA. just use a decent library, make sure you are using UTF-8 encoding, and add a text node. if something is "losing" newlines then it's a bug. xml already has an "encoding" (escaping) that is relatively human readable. it's also standard which makes it much more useful than inventing your own.

see, for example https://stackoverflow.com/a/1140802/181772

Community
  • 1
  • 1
andrew cooke
  • 45,717
  • 10
  • 93
  • 143
  • Just to make sure, if I took any string which could be held in some database's SQL_WCHAR column, encode using one of the standard Perl XML libraries such as XML::Code, and then parse it with expat or C#/Java standard library parsers, I will always get back the original string? – Bwmat Mar 09 '12 at 05:48
  • yes, exactly. if you don't, then there's a bug somewhere. you must set the encoding (so the document starts with ) and follow all the rules - basically use a library to construct the document, rather than "by hand" with strings or print statements. – andrew cooke Mar 09 '12 at 10:51
  • ps similarly, on the other side, you should parse it with a library, and not use regexp etc. – andrew cooke Mar 09 '12 at 11:06
0

You could encode the content, if the content was HTML for example:

<html>&lt;b&gt;Bold Text&lt;/b&gt;</html>

vs.

<html><![CDATA[<b>Bold Text</b>]]></html>
yas
  • 3,520
  • 4
  • 25
  • 38
  • There should be no issue with whitespace or newlines. While rendered HTML, for example, collapses whitespace or newlines, they would be preserved in the XML. – yas Mar 07 '12 at 22:56
0

Hmm, as far as I can tell CDATA sections are for character data, and control characters don't count. I assume this means that on the matter of newlines, XML parsers make a judgement call about whether they are a control character or not (historically, yes, but pratically... no.).

While it would impair readability, you can encode newlines using escape sequences, Assuming that you are escaping properly, parsing should convert it properly, you'll just have to make note of it when encoding.

Another option, that completely violates your "human-readable" requirement is to base-64 encode the text, this allows you to encode arbitrary information in the XML.

Aatch
  • 1,846
  • 10
  • 19
  • There is no judgement call involved. Control characters other than newline carriage return tab and space are not allowed in XML at all (inside CDATA or outside) white space parsing is not affected by CDATA. The only characters whose interpretation is changed by CDATA are `<` and `&` – David Carlisle Mar 07 '12 at 23:35