2

I'm trying to figure out the fastest way to count the number of child elements of a Xerces C++ DOMNode object, as I'm trying to optimise the performance of a Windows application which uses the Xerces 2.6 DOMParser.

It seems most of the time is spent counting and accessing children. Our application needs to iterate every single node in the document to attach data to it using DOMNode::setUserData() and we were initially using DOMNode::getChildNodes(), DOMNodeList::getLength() and DOMNodeList::item(int index) to count and access children, but these are comparatively expensive operations.

A large performance improvement was observed when we used a different idiom of calling DOMNode:: getFirstChild() to get the first child node and invoke DOMNode::getNextSibling() to either access a child at a specific index or count the number of siblings of the first child element to get a total child node count.

However, getNextSibling() remains a bottleneck in our parsing step, so I'm wondering is there an even faster way to traverse and access child elements using Xerces.

phuclv
  • 37,963
  • 15
  • 156
  • 475
ericc
  • 334
  • 1
  • 14
  • Have you tried `DOMNodeIterator`? Don't know if it will be any better or just do one of the things you have already tried. – BoBTFish Jul 18 '12 at 14:17
  • Is it a static DOM tree? Are the same nodes visited over and over again? Maybe it is a possibility to remember the child node counts in a separate data structure, a std::map for example with a node pointer and an integer. – Clemens Jul 22 '12 at 13:47
  • Yes soon after I posted, I added code to store and manage the child count for each node, and this has made a big difference. The same nodes were being visited repeatedly and the child count was being recalculated every time. This is quite an expensive operation as Xerces essentially rebuilds the DOM structure for that node to guarantee its liveness. We have our own object which encapsulates a Xerces DOMNode along with extra info that we need , and we use DOMNode::setUserData to associate our object with the relevant DOMnode, and that now seems to be the last remaining bottleneck. – ericc Jul 23 '12 at 11:40

2 Answers2

1

The problem with DOMNodeList is, that it is really a quite simple list, thus such operations like length and item(i) have costs of O(n) as can be seen in code, for example here for length:

XMLSize_t DOMNodeListImpl::getLength() const{
    XMLSize_t count = 0;
    if (fNode) {
        DOMNode *node = fNode->fFirstChild;
        while(node != 0){
            ++count;
            node = castToChildImpl(node)->nextSibling;
        }
    }

    return count;
}

Thus, DOMNodeList should not be used if one doesn't expect that the DOM-tree will be changed while iterating, because accessing an item in O(n) thus making iteration a O(n^2) operation - a disaster waiting to happen (i.e. a xml-file big enough).

Using DOMNode::getFistChild() and DOMNode::getNextSibling() is a good enough solution for an iteration:

DOMNode *child = docNode->getFirstChild();
while (child != nullptr) {
    // do something with the node
    ...
    child = child->getNextSibling();
}

Which happens as expected in O(n).

One also could use DOMNodeIterator , but in order to create it the right DOMDocument is needed, which is not always at hand when an iteration is needed.

ead
  • 32,758
  • 6
  • 90
  • 153
0

Yes soon after I posted, I added code to store and manage the child count for each node, and this has made a big difference. The same nodes were being visited repeatedly and the child count was being recalculated every time. This is quite an expensive operation as Xerces essentially rebuilds the DOM structure for that node to guarantee its liveness. We have our own object which encapsulates a Xerces DOMNode along with extra info that we need , and we use DOMNode::setUserData to associate our object with the relevant DOMnode, and that now seems to be the last remaining bottleneck.

Paul Sweatte
  • 24,148
  • 7
  • 127
  • 265