0

We have a need to retrieve all the subtending text nodes of an element, whether direct or indirect. The textNodes() method on the Element class is returning only the direct child text nodes, and not the grand text nodes, great-grand text nodes etc.

Given the following sample HTML file:

<html>
    <head/>
    <body>
        <div class="erece mtmhp">
            <a href="http://www.stackoverflow.com">
                <span>Content 1</span>
                <span>Content 2</span>
            </a>
        </div>
    </body>
</html>

I would like to be able to retrieve Content 1 and Content 2, but separately.

Here is my sample code:

import java.io.File;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;

public class TextNodeExpt {
    public static void main(String[] args) throws Exception {

        File fileObj = new File(args[0]);
        Document document = Jsoup.parse(fileObj, "UTF-8");

        Elements divs = document.select("div.erece.mtmhp");

        displaySubtendingTextNodes(divs);
    }

    protected static void displaySubtendingTextNodes(Elements divs) {
        for (Element div : divs) {
            Elements anchors = div.select("a");
            for (Element anchor : anchors) {
                List<TextNode> textNodes = anchor.textNodes();
                System.out.println(textNodes.size() + " text nodes found");
                for(TextNode tn : textNodes) {
                    System.out.println("[" + tn.getWholeText() + "]");
                    }
            }
        }
    }
}

Addendum:
Based on the comment from Hovercraft Full Of Eels, I have come up with the following implementation. It is based on this Stackoverflow posting.

I have made use of List<TextNode> in the API, similar to the textNodes() method on the Element class (although the list is passed as a parameter, and not returned from the method).

protected static void textNodes(Node targetNode, List<TextNode> nodeList) {
    for (Node childNode : targetNode.childNodes()) {
        if (childNode instanceof TextNode && !((TextNode) childNode).isBlank()) {
            nodeList.add((TextNode)childNode);
        }
        else {
            textNodes(childNode, nodeList);
        }
    }
}
Sandeep
  • 1,245
  • 1
  • 13
  • 33

1 Answers1

1

It's not entirely clear what you mean - "retrieve separately" - However, if you want to control and customize the way the text is received from the element, then you should create a separate class implementing the NodeVisitor interface and define the extraction method so that it would not be challenging to manipulate strings in the future "Content 1" and "Content 2" separately, or even put such lines into a collection.

      Element targetElement = doc.select("div.erece").first();
    TextNodeVisitor textNodeVisitor = new TextNodeVisitor();
    NodeTraversor traversor = new NodeTraversor(textNodeVisitor);
        traversor.traverse(targetElement);
    String extractedText = textNodeVisitor.getExtractedText();
        System.out.println(extractedText);
}
static class TextNodeVisitor implements NodeVisitor {
    private StringBuilder extractedText = new StringBuilder();
    @Override
    public void head(Node node, int depth) {
        if (node instanceof org.jsoup.nodes.TextNode) {
            org.jsoup.nodes.TextNode textNode = (org.jsoup.nodes.TextNode) node;
            String text = textNode.text().trim();
            if (!text.isEmpty()) {
                extractedText.append(text).append("\n");
            }
        }
    }

    @Override
    public void tail(Node node, int depth) {
        // Do nothing on tail
    }

    public String getExtractedText() {
        return extractedText.toString();
    }

    public List<String> getLineList(){
        List<String> stringList = List.of(extractedText.toString().split("\n"));
        return stringList;
    } 
Sergei S.
  • 11
  • 4