DOM navigation: eliminating the text nodes

Question

I have a js script that reads and parses XML. It obtains the XML from an XMLHttpRequest request (which contacts with a php script which returns XML). The script is supposed to receive 2 or more nodes under the first parentNode. The 2 nodes it requires have the name well defined, the other ones can be any name. The output from the php may be:

<?xml version='1.0'?>
<things>
    <carpet>
        <id>1</id>
        <name>1</name>
        <desc>1.5</desc>
    </carpet>
    <carpet>
        <id>2</id>
        <name>2</name>
        <height>unknown</height>
    </carpet>
</things>

Here all carpets have 7 nodes.

but it also may be:

<?xml version='1.0'?>
<things>
    <carpet>
        <id>1</id>
        <name>1</name>
        <desc>1.5</desc>
    </carpet>
    <carpet><id>2</id><name>2</name><height>unknown</height></carpet>
</things>

Here the first carpet has 7 nodes, the 2nd carpet has 3 nodes. I want my javascript code to treat both exactly the same way in a quick and clean way. If possible, I'd like to remove all the text nodes between each tag. So a code like the one above would always be treated as:

<?xml version='1.0'?>
    <things><carpet><id>1</id><name>1</name><desc>1.5</desc></carpet><carpet><id>2</id><name>2</name><height>unknown</height></carpet></things>

Is that possible in a quick and efficient way? I'd like not to use any get function (getElementsByTagName(), getElementById, ...), if possible and if more efficient.

T.J. Crowder · Accepted Answer · 2011-04-28T10:45:05.907

It's pretty straightforward to walk the DOM and remove the nodes you consider empty (containing only whitespace).

This is untested (tested and fixed, live copy here), but it would look something like this (replace those magic numbers with symbols, obviously):

var reBlank = /^\s*$/;
function walk(node) {
    var child, next;
    switch (node.nodeType) {
        case 3: // Text node
            if (reBlank.test(node.nodeValue)) {
                node.parentNode.removeChild(node);
            }
            break;
        case 1: // Element node
        case 9: // Document node
            child = node.firstChild;
            while (child) {
                next = child.nextSibling;
                walk(child);
                child = next;
            }
            break;
    }
}
walk(xmlDoc); // Where xmlDoc is your XML document instance

There my definition of "blank" is anything which only has whitespace according to the JavaScript interpreter's understanding of the \s (whitespace) RegExp class. Note that some implementations have issues with \s not being inclusive enough (several Unicode "blank" characters outside the ASCII range not being matched, etc.), so be sure to test with your sample data.

I will not use your suggestion but you gave me the idea I needed to go to a step forward which is enough to solve this, thanks. — brunoais, Apr 28 '11 at 10:45
@brunoais Care to share your own solution/what you ended up doing? Just for interest's sake. — devios1, Sep 13 '11 at 13:38

score 0 · Answer 2 · answered Apr 28 '11 at 10:31

0

I would just try a very crude string replace: assuming you store this in a variable called xml:

var rex = /(\<(\/)?[A-Za-z0-9]+\>)(\s)+/gi;
var a = xml.replace( rex, "$1" );

here's the complete test I put together:

<html><head></head>

<body>
<script type="text/javascript">
var xml = "<?xml version='1.0'?>\n" + 
"<things>\n" +
"    <carpet>\n" +
"        <id>1</id>\n" +
"        <name>1</name>\n" +
"        <desc>1.5</desc>\n" +
"    </carpet>\n" +
"    <carpet>\n" +
"        <id>2</id>\n" +
"        <name>2</name>\n" +
"        <height>unknown</height>\n" +
"    </carpet>\n" +
"</things>";

var rex = /(\<(\/)?[A-Za-z0-9]+\>)(\s)+/gi;
var a = xml.replace( rex, "$1" );
alert( a );

</script>


</body></html>

answered Apr 28 '11 at 10:31

Liv

6,006
1
22
29

@Liv your solution is using regex which is a mid or slow way to solve. Also, you are assuming that I receive text from the server which is wrong I receive an XML Object. Sorry but your solution is no solution. – brunoais Apr 28 '11 at 10:39
your XMLHttpRequest allows you to retrieve the response as a text via its property `responseText`. Also, the regex processing is not slow -- that is a misconception; you will find out in fact it is much faster than writing your own code to do search and replace within a string; even more so it is DEFINITELY faster than a DOM traversal! But I accept the fact that you don't want to use regex's and prefer DOM only methods. – Liv Apr 28 '11 at 10:44
Using regular expressions to modify XML or HTML markup at the macro scale tends not to work very well. – T.J. Crowder Apr 28 '11 at 10:50
@T.J. Crowder : Pray elaborate! – Liv Apr 28 '11 at 10:50
@Liv: Just poke around here for a while, and you'll see question after question along the lines of "why doesn't my regex parse this HTML / XML correctly?" And the answer is usually because they need a proper parser. Fortunately, in this case, we have an *excellent* parser right here: The browser's parser. Why not use it? Re your specific regexp above: What if an element has attributes? What about `CDATA` sections? Processing instructions? You get the idea... If we want to work with text nodes, let's work with text nodes. – T.J. Crowder Apr 28 '11 at 11:59
1

I hear you -- regex is not great for parsing XML IN GENERAL. Given the XML described in the above question though, when there are no attributes etc the regex will work. In other words I answered THIS question not a generic question about parsing XML using regex. (And yes, I agree with your point that regex is NOT the right way of parsing XML!) – Liv Apr 28 '11 at 12:05

DOM navigation: eliminating the text nodes

2 Answers2