4

I'm trying to write a jQuery or pure Javascript function (preferring the more readable solution) that can count the length of a starting tag or ending tag in an HTML document.

For example,

<p>Hello.</p>

would return 3 and 4 for the starting and ending tag lengths. Adding attributes,

<span class="red">Warning!</span>

would return 18 and 7 for the starting and ending tag lengths. Finally,

<img src="foobar.png"/>

would return 23 and 0 (or -1) for the starting and ending tag lengths.

I'm looking for a canonical, guaranteed-to-work-according-to-spec solution, so I'm trying to use DOM methods rather than manual text manipulations. For example, I would like the solution to work even for weird cases like

<p>spaces infiltrating the ending tag</ p >

and

<img alt="unended singleton tags" src="foobar.png">

and such. That is, my hope is that as long as we use proper DOM methods, we should be able to find the number of characters between < and > no matter how weird things get, even

<div data-tag="<div>">HTML-like strings within attributes</div>

I have looked at the jQuery API (especially the Manipulation section, including DOM Insertion and General Attributes subsections), but I don't see anything that would help.

Currently the best idea I have, given an element node is

lengthOfEndTag = node.tagName.length + 3;

lengthOfStartTag = node.outerHTML.length
                 - node.innerHTML.length
                 - lengthOfEndTag;

but of course I don't want to make such an assumption for the end tag.

(Finally, I'm familiar with regular expressions—but trying to avoid them if at all possible.)


EDIT

@Pointy and @squint helped me understand that it's not possible to see </ p >, for example, because the HTML is discarded once the DOM is created. That's fine. The objective, adjusted, is to find the length of the start and end tags as would be rendered in outerHTML.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • 7
    The browser isn't really under any obligation to record and expose the source code details of HTML tags as they appeared when parsed. – Pointy May 03 '13 at 15:07
  • @Pointy - I don't disagree; but yet the browser does expose things like `outerHTML` and `innerHTML`; so I feel there could yet be a way to do what I'm seeking... – Andrew Cheong May 03 '13 at 15:09
  • 1
    After your page loads, the original HTML markup is gone. Closest you can do is ask the browser to read the DOM and render what it observes to a new HTML string. –  May 03 '13 at 15:09
  • @squint - Hm, okay, then let's say I'm speaking not about "raw, as-is HTML," but the HTML that `outerHTML` provides. That'd be good enough. (I'll edit my question.) – Andrew Cheong May 03 '13 at 15:11
  • @BoltClock's comment disappeared, but, let's assume valid markup for now. – Andrew Cheong May 03 '13 at 15:12
  • @acheong87: Using `.outerHTML`, try `var m = foo.outerHTML; var res = m.slice(0, m.indexOf(">") + 1);` –  May 03 '13 at 15:12
  • ...oops, didn't read all the details in your question ...attribute values with `>` and so on. –  May 03 '13 at 15:14
  • Note also that not all tags will have closing tags (``, `
    `).
    – Pointy May 03 '13 at 15:15
  • @Pointy - Yes, I mentioned singleton tags in my question, with examples. – Andrew Cheong May 03 '13 at 15:16
  • ...to resolve the attribute issue, you could try iterating the `.attributes` object, and building the string manually with attribute names and values *(plus tag name, spaces, brackets, etc)*. but consider that boolean attributes can take many forms. `selected` and `selected="selected"` are equivalent, so you'll never know which was originally sent. –  May 03 '13 at 15:19
  • ...another possibility is to make an XHR request for the same page. Assuming the output is static, you could take the markup and attempt some nasty string analysis, or just use a JavaScript based HTML parser. –  May 03 '13 at 15:22
  • 2
    can you elaborate a situation where it will need? – Notepad May 03 '13 at 20:15
  • 2
    I'm curious as to why you want to do this? What are you trying to do? – gen_Eric May 03 '13 at 20:33
  • @SusheelMishra - I was trying to answer this question by another user: http://stackoverflow.com/questions/16359314/how-do-i-find-the-string-index-of-a-tag-an-element-without-counting-expanded-e/16360608#16360608, who's trying to detect the starting and ending indices of a text selection by the user, _without counting expanded HTML entities_. – Andrew Cheong May 03 '13 at 21:00
  • @RocketHazmat - See above. (Can only tag one user per comment.) – Andrew Cheong May 03 '13 at 21:00

2 Answers2

1

An alternate way to do this could be to use XMLSerializer's serializeToString on a clone copy of the node (with id set) to avoid having to parse innerHTML, then split over "><"

var tags = (function () {
    var x = new XMLSerializer(); // scope this so it doesn't need to be remade
    return function tags(elm) {
        var s, a, id, n, o = {open: null, close: null}; // spell stuff with var
        if (elm.nodeType !== 1) throw new TypeError('Expected HTMLElement');
        n = elm.cloneNode(); // clone to get rid of innerHTML
        id = elm.getAttribute('id'); // re-apply id for clone
        if (id !== null) n.setAttribute('id', id); // if it was set
        s = x.serializeToString(n); // serialise
        a = s.split('><');
        if (a.length > 1) { // has close tag
            o.close = '<' + a.pop();
            o.open = a.join('><') + '>'; // join "just in case"
        }
        else o.open = a[0]; // no close tag
        return o;
    }
}()); // self invoke to init

After running this, you can access .length of open and close properties

tags(document.body); // {open: "<body class="question-page">", close: "</body>"}

What if an attribute's value has >< in it? XMLSerializer escapes this to &gt;&lt; so it won't change the .split.
What about no close tag? close will be null.

Paul S.
  • 64,864
  • 9
  • 122
  • 138
0

This answer helped me understand what @Pointy and @squint were trying to say.

The following solution works for me:

$.fn.lengthOfStartTag = function () {
    var node = this[0];
    if (!node || node.nodeType != 1) {
        $.error("Called $.fn.lengthOfStartTag on non-element node.");
    }
    if (!$(node).is(":empty")) {
        return node.outerHTML.indexOf(node.innerHTML);
    }
    return node.outerHTML.length;
}

$.fn.lengthOfEndTag = function () {
    var node = this[0];
    if (!node || node.nodeType != 1) {
        $.error("Called $.fn.lengthOfEndTag on non-element node.");
    }
    if (!$(node).is(":empty")) {
        var indexOfInnerHTML = node.outerHTML.indexOf(node.innerHTML);
        return node.outerHTML.length - (indexOfInnerHTML + node.innerHTML.length);
    }
    return -1;
}

Sample jsFiddle here.

Community
  • 1
  • 1
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • You seem to be using `is(":empty")` to determine if a tag is self closing. This isn't a good test as `
    ` is also empty.
    – James Montagne May 03 '13 at 20:49
  • Just be aware that the tags will not count extra white spaces. – Daniel Moses May 03 '13 at 20:49
  • @DMoses - Yes, I realized that that part of my requirement was too crazy, and unnecessary anyway. Thanks for the caution though. – Andrew Cheong May 03 '13 at 21:01
  • @JamesMontagne - Oh, you're right. `$('').get(0).outerHTML` returns ``, `` returns `` as well, `
    ` returns `
    `. I think the only workaround for this might be to actually use a list of singleton tags, _e.g._ `area|br|col|embed|hr|img|input|link|meta|param`. For now I'll just leave it a caveat in my answer. Thanks for the alert.
    – Andrew Cheong May 03 '13 at 21:06
  • @JamesMontagne - No, wait, actually, I think it's fine as it is. In any case that an element is "empty", the length of the `outerHTML` is returned. That's all I really want, in the end. Appreciate the input regardless. – Andrew Cheong May 03 '13 at 21:08