0

I read elements with CDATA sections from a rss-feed which I need to convert to valid xml. The content in the CDATA section is mostly valid xhtml, but some times characters like ampersand appear in attributes (url's).

I can use .replaceAll("&", "&") to solve this but thinking a bit forward it may be that other invalid characters show up in attributes or text.

The CMS to which I'm importing the element, won't accept CDATA sections without setting up another configuration for the content, so my question is: is there any simple way to escape the string, only for attributes and text?

I'm using the jdom library to manipulate the xml after the import.

Edit: I've checked out apache's StringEscapeUtils, but this is escaping the whole string. I need something that will only escape attribute values and text inside elements.

Karine
  • 183
  • 4
  • 14
  • `.replaceAll("&", "&")` will mess up any existing HTML entities. E.g. `<` would become `&lt;`. – Duncan Jones Sep 05 '12 at 10:41
  • see this link http://stackoverflow.com/questions/599634/convert-html-character-back-to-text-using-java-standard-library – yael alfasi Sep 05 '12 at 10:43
  • That's true as well, Duncan. The StringEscapeUtils will escape the whole string, and is not exactly what I'm looking for. – Karine Sep 05 '12 at 10:53

2 Answers2

2

Apache Commons provides handy functions for this: StringEscapeUtils

R. Oosterholt
  • 7,720
  • 2
  • 53
  • 77
Doug
  • 382
  • 1
  • 4
  • I've tried this, unfortunately this escapes the whole string, including the <> surrounding the elements. I'm looking for something that only escapes attributevalues and text. – Karine Sep 05 '12 at 10:44
0

When you use JDOM it will automatically correctly escape ay content that needs it. Is your CMS loaded with the output of JDOM, or are you using some other library to populate the CMS...?

In essence, if you have valid XML input, and you use JDOM (something from org.jdom2.output.*) to output the data, then you will always have good output.... so, what are you doing to have broken output?

Rolf

rolfl
  • 17,539
  • 7
  • 42
  • 76
  • The CMS is loaded with the output of JDOM. The problem is I don't always have a valid input to the SaxBuilder, as some attribute values may include unescaped ampersands. – Karine Sep 05 '12 at 11:12
  • ...in which case, JDOM will have decoded those escaped characters (or your SAX Parser will), and what you see in JDOM will be unescaped, and will be re-escaped when output. – rolfl Sep 05 '12 at 16:50
  • Not sure if I follow you. The JDOM SaxBuilder will not accept an invalid (unescaped ampersands) XML string. – Karine Sep 27 '12 at 15:12
  • Exactly, the XML Parser will see ... & in the input, and then decode it. JDOM will see just & and then, if you output using JDOM, JDOM will then re-encode the & as & in the output. Now, lets's see if those characters survive StackExchange's comment system – rolfl Sep 27 '12 at 19:14