2

I'm reading a XML file with dom4j. The file looks like this:

...
<Field>&#13;&#10; hello, world...</Field>
...

I read the file with SAXReader into a Document. When I use getText() on a the node I obtain the followin String:

\r\n hello, world...

I do some processing and then write another file using asXml(). But the characters are not escaped as in the original file which results in error in the external system which uses the file.

How can I escape the special character and have &#13;&#10; when writing the file?

ewernli
  • 38,045
  • 5
  • 92
  • 123
woezelmann
  • 1,355
  • 2
  • 19
  • 39
  • Do you mean you get a literal newline in your string, or you get "\r\n" in your string (i.e. as characters?) – Andy Shellam Feb 12 '10 at 13:18
  • i get the newline literal. But it doesn't matter, cause i would like to get the characters ' ' – woezelmann Feb 12 '10 at 13:23
  • Why do you want to keep them as ' '? What are you trying to achieve with XML? – ewernli Feb 12 '10 at 13:36
  • I need to read a xml file, to some stuff on the values and attributes and than write a new one... – woezelmann Feb 12 '10 at 13:48
  • @woezelmann: but ` ` and the actual `\r` character are equivalent in XML. They are absolutely interchangeable. – Joachim Sauer Feb 12 '10 at 13:50
  • maybe for you, but not for the system i am passing the xml to. ok nevermind, i'm going to simply replace all '\r\n\' by ' ' in java – woezelmann Feb 12 '10 at 13:53
  • I don't see then where is the problem: read the XML and let DOM process the entities, do your stuff based on \r\n, then write back another XML where entities will again be transformed by DOM for you. If your external system works only with \r\n in the XML it means that it is not processing XML correctly and you may expect a ton of other issue with encoding, foreign characters, etc. – ewernli Feb 12 '10 at 14:01
  • when writing the new xml file, '\r\n\' is NOT transformed to ' ' and the external systems works only with ' ' – woezelmann Feb 12 '10 at 14:06
  • 1
    Then the external system is completely broken; it is not an ‘XML Parser’ by definition. – bobince Feb 12 '10 at 14:29
  • I think I finally understood your problem. I've edited your question then, but feel free to let me know if you don't agree. – ewernli Feb 12 '10 at 15:33

4 Answers4

1

You cannot easily. Those aren't 'escapes', they are 'character entities'. They are a fundamental part of XML. Xerces has some very complex support for 'unparsed entities', but I doubt that it applies to these, as opposed to the species that are defined in a DTD.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
1

It depends on what you're getting and what you want (see my previous comment.)

The SAX reader is doing nothing wrong - your XML is giving you a literal newline character. If you control this XML, then instead of the newline characters, you will need to insert a \ (backslash) character following by the "r" or "n" characters (or both.)

If you do not control this XML, then you will need to do a literal conversion of the newline character to "\r\n" after you've gotten your string back. In C# it would be something like:

myString = myString.Replace("\r\n", "\\r\\n");
Andy Shellam
  • 15,403
  • 1
  • 27
  • 41
  • My problem is, that I am reading a xml-file containing ' ', doing some convertion and than writing a new xml-file. And in this new xml-file I would like to have ' ' again. I don't want something like "\r\n" or "\\r\\n" – woezelmann Feb 12 '10 at 13:32
  • So why are you worried about escaping them then? I believe with Xerces (certainly in the C++ version) if you encode the actual literal newline character, it will come out as you had previously. If you escape them before you re-encode it, then you'll get the characters "\r\n" in your XML instead of Incidentally a double back-slash in C# does come out as a single backslash in a string - it's a way of telling the compiler not to treat it as an escape sequence. – Andy Shellam Feb 12 '10 at 17:53
1

XML entities are abstracted away in DOM. Content is exposed with String without the need to bother about the encoding -- which in most of the case is what you want.

But SAX has some support for how entities are processed. You could try to create a XMLReader with a custom EntityResolver#resolveEntity, and pass it as parameter to the SAXReader. But I feat it may not work:

The Parser will call this method before opening any external entity except the top-level document entity (including the external DTD subset, external entities referenced within the DTD, and external entities referenced within the document element)

Otherwise you could try to configure a LexicalHandler for SAX in a way to be notified when an entity is encountered. Javadoc for LexicalHandler#startEntity says:

Report the beginning of some internal and external XML entities.

You will not be able to change the resolving, but that may still help.

EDIT

You must read and write XML with the SAXReader and XMLWriter provided by dom4j. See reading a XML file and writing an XML file. Don't use asXml() and dump the file yourself.

FileOutputStream fos = new FileOutputStream("simple.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(doc);
writer.flush();
ewernli
  • 38,045
  • 5
  • 92
  • 123
0

You can pre-process the input stream to replace & to e.g. [$AMPERSAND_CHARACTER$], then do the stuff with dom4j, and post-process the output stream making the back substitution.

Example (using streamflyer):

import com.github.rwitzel.streamflyer.util.ModifyingReaderFactory;
import com.github.rwitzel.streamflyer.util.ModifyingWriterFactory;

// Pre-process
Reader originalReader = new InputStreamReader(myInputStream, "utf-8");
Reader modifyingReader = new ModifyingReaderFactory().createRegexModifyingReader(originalReader, "&", "[\\$AMPERSAND_CHARACTER\\$]");

// Read and modify XML via dom4j
SAXReader xmlReader = new SAXReader();
Document xmlDocument = xmlReader.read(modifyingReader);
// ...

// Post-process
Writer originalWriter = new OutputStreamWriter(myOutputStream, "utf-8");
Writer modifyingWriter = new ModifyingWriterFactory().createRegexModifyingWriter(originalWriter, "\\[\\$AMPERSAND_CHARACTER\\$\\]", "&");

// Write to output stream
OutputFormat xmlOutputFormat = OutputFormat.createPrettyPrint();
XMLWriter xmlWriter = new XMLWriter(modifyingWriter, xmlOutputFormat);
xmlWriter.write(xmlDocument);
xmlWriter.close();

You can also use FilterInputStream/FilterOutputStream, PipedInputStream/PipedOutputStream, or ProxyInputStream/ProxyOutputStream for pre- and post-processing.