4

I have an Android application that uses the SAX parsers to extract data from an XML file. Some of the data is found within some CDATA tags and sometimes contains newline characters. Those newline characters are being removed during parsing. How do I preserve them?

By the way, I thought I found an answer here, but placing "
" inside a CDATA tag will only result in getting a "
" when I parse it.

Does anyone have any suggestions?

Thank you.

Community
  • 1
  • 1
mahdaeng
  • 791
  • 4
  • 15
  • 25
  • you can see this topic : http://stackoverflow.com/questions/3401111/preserve-newlines-when-parsing-xml/14071260#14071260 – ghost rider3 Dec 28 '12 at 14:51

1 Answers1

2

Linefeeds are not removed by parser, whether they are as regular characters, or within CDATA section. But in both cases, various linefeeds (Unix, Windows, Mac) are normalized into single-character canonical ("unix", \n) linefeed. There is no way to prevent this normalization from happening, except by using character entity like was suggested; and this can not be done in CDATA section because entity handling is disabled there.

But why exactly do you want to prevent this normalization? If you want this for display, you can just replace \n with whatever local linefeed you want (\r for mac, or \r\n sequence for windows).

StaxMan
  • 113,358
  • 34
  • 211
  • 239
  • Thank you, StaxMan. However, the "\n" characters are, indeed, removed during parsing. I know they're not supposed to be removed, but they are. That is the problem. Is there some sort of property that needs to be set to prevent this? – mahdaeng Dec 15 '10 at 20:19
  • Which SAX parsers does Android use? If this really occurs, it sounds like a bug to be reported. I am not aware of any property to remove them on any parser I have used (Xerces, Woodstox), and it would be odd to have such setting enabled. But are you sure they are removed, or are you just printing text out to console? Perhaps console is just not displaying linefeeds there are? Or if including it on web page, HTML collates all white space. I am asking since I have had cases where this has been the problem. – StaxMan Dec 16 '10 at 17:55
  • Thanks, StaxMan. I'm not sure which SAX parser is used. And you may be right - it might be a bug that should be reported. I've decided to approach this problem in a different way. I will not use CDATA tags and then just replace all potentially problematic characters with their mark-up equivalent (e.g., replace "<" with "<"). That will eliminate my need for the CDATA tags and allow me to use the " " solution for the newlines. Thank you for your suggestions. – mahdaeng Dec 17 '10 at 15:08
  • Ah. Yeah, I think they just wrap simplistic XmlPull impl in DOM/SAX -- this explains why linefeed handling would be broken. I doubt Google team will do anything for that, based on history. Filing a bug report wouldn't hurt, but I wouldn't hold my breath for getting a fix; if they cared about correctness, they had chosen better xml parser library. – StaxMan Dec 17 '10 at 18:21
  • this is exactly my same problem, its a shame google uses this simplistic parser. – kolslorr Apr 10 '11 at 03:22
  • 1
    - But why exactly do you want to prevent this normalization? – Dan Nick Jul 07 '17 at 07:23