0

I am trying to use Javascript to modify an existing HTML document so that I can surround every word of text in the web page with a span tag that would have a counter. This is a very specific problem so I am going to provide an example case:

<body><p>hello, <br>
change this</p> 
<img src="lorempixel.com/200/200> <br></body></html>

This should change to:

  <body><p><span id="1">hello,</span>
  <br> <span id="2"> change</span><span id="3"> this</span> </p>
  <br> <img src="lorempixel.com/200/200> <br></body></html>

I am thinking or regex solutions but they get truly complicated and I am not sure of how to ignore tags and change text without completely breaking the page.

Any thoughts appreciated!

Brian HK
  • 860
  • 1
  • 8
  • 18
  • Just curious, what do you want to do with `span`s later? – dashtinejad Sep 25 '14 at 03:58
  • What about words already in a ``? – slebetman Sep 25 '14 at 04:00
  • @ROX We would like to be able to later, in Javascript, be able to call the innerHTML method on this specific id to change the change the background of the span. – Brian HK Sep 25 '14 at 04:05
  • @slebetman Hmm... would adding a second nested span break the already existing span tags? I'm quite new to HTML and javascript so I had not thought about this. What do you think? – Brian HK Sep 25 '14 at 04:06
  • 1
    Kind of like this in mind? http://stackoverflow.com/questions/7563169/detect-which-word-has-been-clicked-on-within-a-text – Jani Hyytiäinen Sep 25 '14 at 04:13
  • Maybe using a library would be easier: [Wrapping words with lettering](https://github.com/davatron5000/Lettering.js/wiki/Wrapping-words-with-lettering%28%27words%27%29) – dashtinejad Sep 25 '14 at 04:19
  • @JaniHyytiäinen Very much like that! I will study the solutions offered to see if I can find anything applicable. To expand on the purpose, I am looking to build an app that highlights whatever the user has in its selection. Then I need to be able to un-highlight and to also store whatever is being highlighted and store it in a database. – Brian HK Sep 25 '14 at 04:27
  • @ROX That looks crazy useful! Can I apply it to a tag that is not the immediate parent to a textnode? – Brian HK Sep 25 '14 at 04:28
  • 1
    @Blessoul Just try it. Maybe it is exactly what you want ;) – dashtinejad Sep 25 '14 at 04:29
  • @RobG What do you mean? I wouldn't change the background using its innerhtml, I'd use the ID I'm attaching to the span right? – Brian HK Sep 25 '14 at 04:31

2 Answers2

8

Don't use regex on raw HTML. Use it only on text. This is because regex is a context free parser but HTML is a recursive language. You need a recursive descent parser to properly handle HTML.

First a few useful features of the DOM:

  1. document.body is the root of the DOM
  2. Every node of the DOM has a childNodes array (even comments, text, and attributes)
  3. Element nodes such as <span> or <h> don't contain text, instead they contain text nodes that contain text.
  4. All nodes have a nodeType property and text node is type 3.
  5. All nodes have a nodeValue property that holds different things depending on what kind of node it is. For text nodes nodeValue contains the actual text.

So, using the information above we can surround all words with a span.

First a simple utility function that allows us to process the DOM:

// First a simple implementation of recursive descent,
// visit all nodes in the DOM and process it with a callback:
function walkDOM (node,callback) {
    if (node.nodeName != 'SCRIPT') { // ignore javascript
        callback(node);
        for (var i=0; i<node.childNodes.length; i++) {
            walkDOM(node.childNodes[i],callback);
        }
    }
}

Now we can walk the DOM and find text nodes:

var textNodes = [];
walkDOM(document.body,function(n){
    if (n.nodeType == 3) {
        textNodes.push(n);
    }
});

Note that I'm doing this in two steps to avoid wrapping words twice.

Now we can process the text nodes:

// simple utility functions to avoid a lot of typing:
function insertBefore (new_element, element) {
    element.parentNode.insertBefore(new_element,element);
}
function removeElement (element) {
    element.parentNode.removeChild(element);
}
function makeSpan (txt, attrs) {
    var s = document.createElement('span');
    for (var i in attrs) {
        if (attrs.hasOwnProperty(i)) s[i] = attrs[i];
    }
    s.appendChild(makeText(txt));
    return s;
}
function makeText (txt) {return document.createTextNode(txt)}

var id_count = 1;
for (var i=0; i<textNodes.length; i++) {
    var n = textNodes[i];
    var txt = n.nodeValue;
    var words = txt.split(' ');

    // Insert span surrounded words:
    insertBefore(makeSpan(words[0],{id:id_count++}),n);
    for (var j=1; j<words.length; j++) {
        insertBefore(makeText(' '),n); // join the words with spaces
        insertBefore(makeSpan(words[j],{id:id_count++}),n);
    }
    // Now remove the original text node:
    removeElement(n);
}

There you have it. It's cumbersome but is 100% safe - it will never corrupt other tags of javascript in your page. A lot of the utility functions I have above can be replaced with the library of your choice. But don't take the shortcut of treating the entire document as a giant innerHTML string. Not unless you're willing to write an HTML parser in pure javascript.

slebetman
  • 109,858
  • 19
  • 140
  • 171
  • Warning: the code here is untested but the theory is correct. Please test it for bugs before introducing it to production code. – slebetman Sep 25 '14 at 05:02
  • Note: since the walkDOM is recursive you can start at an element/node of your choice instead of document.body. – slebetman Sep 25 '14 at 05:07
  • I think you need to convert *childNodes* to an array (or static object of some kind) before looping over it. It's a live [*NodeList*](http://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-536297177), so as you add nodes it will change in length in a way that is not coordinated with your adding nodes and incrementing *i*. It also needs to skip more than just script elements (e.g. textarea, option and button elements have content that is not markup). – RobG Sep 25 '14 at 05:56
  • Instead of walking the DOM yourself, use TreeWalker. –  Sep 25 '14 at 05:59
  • @RobG: That's why I'm doing it in 2 phases. When I add nodes I am no longer iterating over childNodes :) – slebetman Sep 25 '14 at 06:17
  • @RobG: Also, skipping other kinds of content is left as exercise to the user. This is just demonstrating the mechanics behind it. As torazaburo mentioned, you can also substitute the walker with other implementations. Which is why this comes with a warning (the warning indirectly says: I'm showing you the way, not doing your homework for you) – slebetman Sep 25 '14 at 06:20
  • It skips the first word in an element if there is no leading whitespace, e.g. in `

    foo bar

    ` `foo` is not wrapped.
    – RobG Sep 25 '14 at 06:25
  • @RobG: Are you sure? I'm sure it doesn't. See the line before the for loop – slebetman Sep 25 '14 at 06:30
  • Ah, you are adding wrapped nodes in two places! It treats entities like ` ` as word characters so `split(' ')` would be better as `split(/\s+/)`. – RobG Sep 25 '14 at 06:38
  • @RobG: The problem with `split(/\s+/)` is that we cannot know how many spaces to join by for each word. This breaks `
    ` and `` blocks (or any block with fixed width text). Anyway, all this is not important since it's details (bugs) left for the reader to implement (fix). For example, I have not handled what happens when there are two spaces in the text which would result in an empty string which would generate an empty span. That is left as an exercise for the reader. As I said, I'm only showing the way. Not doing other people's homework
    – slebetman Sep 25 '14 at 07:53
  • And all this bug reports and justifications for not fixing them has actually taken me more time than actually answering the question to begin with. – slebetman Sep 25 '14 at 07:53
  • This solution seems straight-forward enough! Thank you for explaining it clearly, I'll play around with this. – Brian HK Sep 25 '14 at 17:57
  • @Slbetman—I've shown how to deal with multiple whitespace characters in my answer, and how to skip certain elements. A valuable lesson when estimating is that the final code will often take 3 to 4 times longer than the initial draft as you deal with all the foibles that weren't evident at the start. ;-) – RobG Sep 25 '14 at 21:15
1

This sort of processing is always a lot more complex than you think. The following will wrap sequences of characters that match \S+ (sequence of non–whitespace) and not wrap sequences that match \s+ (whitespace).

It also allows the content of certain elements to be skipped, such as script, input, button, select and so on. Note that the live collection returned by childNodes must be converted to a static array, otherwise it is affected by the new nodes being added. An alternative is to use element.querySelectorAll() but childNodes has wider support.

// Copy numeric properties of Obj from 0 to length
// to an array
function toArray(obj) {
  var arr = [];
  for (var i=0, iLen=obj.length; i<iLen; i++) {
    arr.push(obj[i]);
  }
  return arr;
}


// Wrap the words of an element and child elements in a span
// Recurs over child elements, add an ID and class to the wrapping span
// Does not affect elements with no content, or those to be excluded
var wrapContent = (function() {
  var count = 0;

  return function(el) {

    // If element provided, start there, otherwise use the body
    el = el && el.parentNode? el : document.body;

    // Get all child nodes as a static array
    var node, nodes = toArray(el.childNodes);
    var frag, parent, text;
    var re = /\S+/;
    var sp, span = document.createElement('span');

    // Tag names of elements to skip, there are more to add
    var skip = {'script':'', 'button':'', 'input':'', 'select':'',
                'textarea':'', 'option':''};

    // For each child node...
    for (var i=0, iLen=nodes.length; i<iLen; i++) {
      node = nodes[i];

      // If it's an element, call wrapContent
      if (node.nodeType == 1 && !(node.tagName.toLowerCase() in skip)) {
        wrapContent(node);

      // If it's a text node, wrap words
      } else if (node.nodeType == 3) {

        // Match sequences of whitespace and non-whitespace
        text = node.data.match(/\s+|\S+/g);

        if (text) {

          // Create a fragment, handy suckers these
          frag = document.createDocumentFragment();

          for (var j=0, jLen=text.length; j<jLen; j++) {

            // If not whitespace, wrap it and append to the fragment
            if (re.test(text[j])) {
              sp = span.cloneNode(false);
              sp.id = count++;
              sp.className = 'foo';
              sp.appendChild(document.createTextNode(text[j]));
              frag.appendChild(sp);

            // Otherwise, just append it to the fragment
            } else {
              frag.appendChild(document.createTextNode(text[j]));
            }
          }
        }

        // Replace the original node with the fragment
        node.parentNode.replaceChild(frag, node);
      }
    }
  }
}());

window.onload = wrapContent;

The above addresses only the most common cases, it will need more work and thorough testing.

RobG
  • 142,382
  • 31
  • 172
  • 209