xerces-c: DOM xml parsing

Question

I have a question about XML parsing. I was experimenting with a sample program and changed it up a bit to try to understand how parsing works however, I've encountered an output I dont quite understand and hope that some of you can shed some light onto what may be going on.

This is my xml file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root xmlns="http://www.test.com">
   <ApplicationSettings>
           <option_a>"10"</option_a> 
           <option_b>"24"</option_b>
   </ApplicationSettings>
</root>

I inserted debug statements throughout my program to try to understand what goes on when function calls such as getChildNodes() processes as it is called. This is the output I received:

Parsing xml file...
Processing Root...
Processing children with getChildNodes()...
>>>>>>>>>>> Loop child 0: Node name is: #text
>>>>>>>>>>> Loop child 1: Node name is: ApplicationSettings
= ApplicationSettings processing children with getChildNodes()...
***** iter 0 child name is #text
***** iter 1 child name is option_a
***** iter 2 child name is #text
***** iter 3 child name is option_b
***** iter 4 child name is #text
>>>>>>>>>>> Loop: 2 Node name is: #text

From the output, I can easily infer it correctly parsed my xml file. However, I noticed the program also detected extra nodes with the name #text (printed out using the getNodeName() function). My question is, what do those #text refer to and why do they appear periodically throughout the loops?

Thanks!

score 3 · Accepted Answer · answered Dec 28 '10 at 17:00

3

Those #text nodes in your example refer to the whitespace between tags. For example here

<root xmlns="http://www.test.com">
   <ApplicationSettings>

there are a line feed and four spaces between ...com"> and <App....

You can try to parse the following to see what happens:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root xmlns="http://www.test.com"><ApplicationSettings><option_a>"10"</option_a><option_b>"24"</option_b></ApplicationSettings></root>

answered Dec 28 '10 at 17:00

khachik

28,112
9
59
94

Interesting. I had a feeling it dealt with whitespaces somehow. Do you know if there is any way to avoid adding these whitespaces as child nodes so I can avoid extra overhead in looping? Is parsing the only solution? – user459811 Dec 28 '10 at 17:04
@user459811 I'm not familiar with xerces, sorry. You should refer to the documentation to find something like "ignore whitespace". – khachik Dec 28 '10 at 18:12
Thanks Khachik. I resorted to using an if-conditional to ignore the whitespace if anyone is interested: if ((currentNode->getNodeType() == DOMNode::TEXT_NODE) || (currentNode->getNodeType() == DOMNode::COMMENT_NODE)) { continue; } – user459811 Dec 28 '10 at 18:21

xerces-c: DOM xml parsing

1 Answers1