2

I have a question about XML parsing. I was experimenting with a sample program and changed it up a bit to try to understand how parsing works however, I've encountered an output I dont quite understand and hope that some of you can shed some light onto what may be going on.

This is my xml file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root xmlns="http://www.test.com">
   <ApplicationSettings>
           <option_a>"10"</option_a> 
           <option_b>"24"</option_b>
   </ApplicationSettings>
</root>

I inserted debug statements throughout my program to try to understand what goes on when function calls such as getChildNodes() processes as it is called. This is the output I received:

Parsing xml file...
Processing Root...
Processing children with getChildNodes()...
>>>>>>>>>>> Loop child 0: Node name is: #text
>>>>>>>>>>> Loop child 1: Node name is: ApplicationSettings
= ApplicationSettings processing children with getChildNodes()...
***** iter 0 child name is #text
***** iter 1 child name is option_a
***** iter 2 child name is #text
***** iter 3 child name is option_b
***** iter 4 child name is #text
>>>>>>>>>>> Loop: 2 Node name is: #text

From the output, I can easily infer it correctly parsed my xml file. However, I noticed the program also detected extra nodes with the name #text (printed out using the getNodeName() function). My question is, what do those #text refer to and why do they appear periodically throughout the loops?

Thanks!

user459811
  • 2,874
  • 10
  • 37
  • 63

1 Answers1

3

Those #text nodes in your example refer to the whitespace between tags. For example here

<root xmlns="http://www.test.com">
   <ApplicationSettings>

there are a line feed and four spaces between ...com"> and <App....

You can try to parse the following to see what happens:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root xmlns="http://www.test.com"><ApplicationSettings><option_a>"10"</option_a><option_b>"24"</option_b></ApplicationSettings></root>
khachik
  • 28,112
  • 9
  • 59
  • 94
  • Interesting. I had a feeling it dealt with whitespaces somehow. Do you know if there is any way to avoid adding these whitespaces as child nodes so I can avoid extra overhead in looping? Is parsing the only solution? – user459811 Dec 28 '10 at 17:04
  • @user459811 I'm not familiar with xerces, sorry. You should refer to the documentation to find something like "ignore whitespace". – khachik Dec 28 '10 at 18:12
  • Thanks Khachik. I resorted to using an if-conditional to ignore the whitespace if anyone is interested: if ((currentNode->getNodeType() == DOMNode::TEXT_NODE) || (currentNode->getNodeType() == DOMNode::COMMENT_NODE)) { continue; } – user459811 Dec 28 '10 at 18:21