How to replace words with span tag using jsoup?

Question

Assume I have the following html:

<html>
<head>
</head>
<body>
    <div id="wrapper" >
         <div class="s2">I am going <a title="some title" href="">by flying</a>
           <p>mr tt</p>
         </div> 
    </div>
</body>    
</html>

Any words in the text nodes that are equal to or greater than 4 characters for example the word 'going' is replaced with html content (not text) <span>going<span> in the original html without changing anything else.

If I try do something like element.html(replacement), the problem is if lets the current element is <div class="s2"> it will also wipe off <a title="some title"

score 12 · Accepted Answer · edited May 23 '17 at 11:44

In this case you must traverse your document as suggested by this answer. Here's a way of doing it using Jsoup APIs:

NodeTraversor and NodeVisitor allow you to traverse the DOM
Node.replaceWith(...) allows for replacing a node in the DOM

Here's the code:

public class JsoupReplacer {

  public static void main(String[] args) {
    so6527876();
  }

  public static void so6527876() {
    String html = 
    "<html>" +
    "<head>" +
    "</head>" +
    "<body>" +
    "    <div id=\"wrapper\" >" +
    "         <div class=\"s2\">I am going <a title=\"some title\" href=\"\">by flying</a>" +
    "           <p>mr tt</p>" +
    "         </div> " +
    "    </div>" +
    "</body>    " +
    "</html>";
    Document doc = Jsoup.parse(html);

    final List<TextNode> nodesToChange = new ArrayList<TextNode>();

    NodeTraversor nd  = new NodeTraversor(new NodeVisitor() {

      @Override
      public void tail(Node node, int depth) {
        if (node instanceof TextNode) {
          TextNode textNode = (TextNode) node;
          String text = textNode.getWholeText();
          String[] words = text.trim().split(" ");
          for (String word : words) {
            if (word.length() > 4) {
              nodesToChange.add(textNode);
              break;
            }
          }
        }
      }

      @Override
      public void head(Node node, int depth) {        
      }
    });

    nd.traverse(doc.body());

    for (TextNode textNode : nodesToChange) {
      Node newNode = buildElementForText(textNode);
      textNode.replaceWith(newNode);
    }

    System.out.println("result: ");
    System.out.println();
    System.out.println(doc);
  }

  private static Node buildElementForText(TextNode textNode) {
    String text = textNode.getWholeText();
    String[] words = text.trim().split(" ");
    Set<String> longWords = new HashSet<String>();
    for (String word : words) {
      if (word.length() > 4) {
        longWords.add(word);
      } 
    }
    String newText = text;
    for (String longWord : longWords) {
      newText = newText.replaceAll(longWord, 
          "<span>" + longWord + "</span>");
    }
    return new DataNode(newText, textNode.baseUri());
  }

}

Thanks a lot to both the authors of the 2 answers. @Marcos I used your solution and its very flexible. I wonder why u used Set instead of List? — rjc, Jul 06 '11 at 19:33
@rjc: well, I used `Set longWords` in `buildElementForText` because I can just to memorize distinct words and the use `replaceAll`. I notice that I had also used `final Set nodesToChange`: that was actually a mistake, and I changed it into a `List` — MarcoS, Jul 07 '11 at 07:31
Thanks for reply. I figured that out that must use Set for long words collecton. I wanted to modify nodes within head method rather than collecting the nodes to be modified and modify after traversal. Is this OK or not even allowed? — rjc, Jul 07 '11 at 09:12
@rjc: you cannot modify nodes within the `head` (or `tail`) method, because you're traversing the tree, and modifying nodes while traversing is not a good idea :) This is why I collect nodes to be modified while traversing the tree, and then replace them afterwards. — MarcoS, Jul 07 '11 at 09:32

Mark McLaren · Answer 2 · 2011-09-01T07:48:18.910

I think you need to traverse the tree. The result of text() on an Element will be all of the Element's text including text within child elements. Hopefully something like the following code will be helpful to you:

import java.io.File;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.io.FileUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;

public class ScreenScrape {

    public static void main(String[] args) throws IOException {
        String content = FileUtils.readFileToString(new File("test.html"));
        Document doc = Jsoup.parse(content);
        Element body = doc.body();
        //System.out.println(body.toString());

        StringBuilder sb = new StringBuilder();
        traverse(body, sb);

        System.out.println(sb.toString());
    }

    private static void traverse(Node n, StringBuilder sb) {
        if (n instanceof Element) {
            sb.append('<');
            sb.append(n.nodeName());            
            if (n.attributes().size() > 0) {
                sb.append(n.attributes().toString());
            }
            sb.append('>');
        }
        if (n instanceof TextNode) {
            TextNode tn = (TextNode) n;
            if (!tn.isBlank()) {
                sb.append(spanifyText(tn.text()));
            }
        }
        for (Node c : n.childNodes()) {
            traverse(c, sb);
        }
        if (n instanceof Element) {
            sb.append("</");
            sb.append(n.nodeName());
            sb.append('>');
        }        
    }

    private static String spanifyText(String text){
        StringBuilder sb = new StringBuilder();
        StringTokenizer st = new StringTokenizer(text);
        String token;
        while (st.hasMoreTokens()) {
             token = st.nextToken();
             if(token.length() > 3){
                 sb.append("<span>");
                 sb.append(token);
                 sb.append("</span>");
             } else {
                 sb.append(token);
             }             
             sb.append(' ');
        }
        return sb.substring(0, sb.length() - 1).toString();
    }

}

UPDATE

Using Jonathan's new Jsoup List element.textNode() method and combining it with MarcoS's suggested NodeTraversor/NodeVisitor technique I came up with (although I am modifying the tree whilst traversing it - probably a bad idea):

Document doc = Jsoup.parse(content);
Element body = doc.body();
NodeTraversor nd = new NodeTraversor(new NodeVisitor() {

    @Override
    public void tail(Node node, int depth) {
        if (node instanceof Element) {
            boolean foundLongWord;
            Element elem = (Element) node;
            Element span;
            String token;
            StringTokenizer st;
            ArrayList<Node> changedNodes;
            Node currentNode;
            for (TextNode tn : elem.textNodes()) {
                foundLongWord = Boolean.FALSE;
                changedNodes = new ArrayList<Node>();
                st = new StringTokenizer(tn.text());
                while (st.hasMoreTokens()) {
                    token = st.nextToken();
                    if (token.length() > 3) {
                        foundLongWord = Boolean.TRUE;
                        span = new Element(Tag.valueOf("span"), elem.baseUri());
                        span.appendText(token);
                        changedNodes.add(span);
                    } else {
                        changedNodes.add(new TextNode(token + " ", elem.baseUri()));
                    }
                }
                if (foundLongWord) {
                    currentNode = changedNodes.remove(0);
                    tn.replaceWith(currentNode);
                    for (Node n : changedNodes) {
                        currentNode.after(n);
                        currentNode = n;
                    }
                }
            }
        }
    }

    @Override
    public void head(Node node, int depth) {
    }
});    
nd.traverse(body);
System.out.println(body.toString());

I suppose ownText() could possibly have been used if it returned a String array of direct child text nodes rather than a String. — Mark McLaren, Jul 06 '11 at 09:23
I've added a List element.textNode() method to jsoup which will be available in 1.6.2. https://github.com/jhy/jsoup/commit/7b9f17760049161b775fd23b15653961620e259d — Jonathan Hedley, Aug 30 '11 at 02:21
here we have `for (TextNode tn : elem.textNodes())` and as @JonathanHedley says it is available in 1.6.2 which is not released. So can we alternatively use this - `for(Node tn : elem.getAllElements()) {if(!(tn instanceof TextNode)) continue;//Some stuff }` — Sudarshan Bhat, Dec 07 '11 at 04:43

score 0 · Answer 3 · answered Jun 11 '13 at 16:23

0

I am replacing word hello with hello(span tag)

Document doc = Jsoup.parse(content);
    Element test =  doc.body();
    Elements elemenets = test.getAllElements();
    for(int i =0 ;i <elemenets .size();i++){
        String elementText = elemenets .get(i).text();
        if(elementText.contains("hello"))
            elemenets .get(i).html(l.get(i).text().replaceAll("hello","<span style=\"color:blue\">hello</span>"));
    }

answered Jun 11 '13 at 16:23

Ruby

1

What is the "l" suppose to be refering to? – Maude Jan 07 '20 at 00:33

How to replace words with span tag using jsoup?

3 Answers3

Linked