0

I'm using dom4j to parse my xml. Let's say I have something like this:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>&#402;</bar>
</foo>

When looking at the value of the "bar" node, it gives me back the special character as represented by "& #402;"

Is there a way to prevent this and just read in the actual bit of text?

banjollity
  • 4,490
  • 2
  • 29
  • 32
digiarnie
  • 22,305
  • 31
  • 78
  • 126

3 Answers3

2

If the value of the bar node were to contain < or > or an & on its own then it would break the parser. In order to protect against this you should escape all data on the way in, and then unescape it on the way out again.

This turns your document into:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>&amp;#402;</bar>
</foo>

It does suck, but that's XML for you.

banjollity
  • 4,490
  • 2
  • 29
  • 32
1

The actual bit of text being &#402;? You need to escape ampersand as &amp; then.

ChssPly76
  • 99,456
  • 24
  • 206
  • 195
  • I've tried that, however, when writing to an output xml, I still want to just show the "&" symbol and not the "&" text. Of course I could just parse through the output file and convert "&" to "&" manually in a text editor but I was hoping to not have to do that. – digiarnie Jul 20 '09 at 01:18
  • 1
    Well, there's a difference between reading and writing. For writing you can call setEscapeText(false) on org.dom4j.io.XMLWriter to write whatever you have verbatim. If you do that, keep in mind that your reading / writing cycle will change the document so you have to be careful. – ChssPly76 Jul 20 '09 at 04:23
0

If you need this to preserve numeric character references like &#nnnn or character entity references like &something while reading-writing the XML file, you can:

  1. Pre-process the input stream replacing & to e.g. [$AMPERSAND_CHARACTER$]
  2. Modify the XML via dom4j
  3. Post-process the output stream making the back substitution

See the example of code.

Community
  • 1
  • 1