We have a need to retrieve all the subtending text nodes of an element, whether direct or indirect. The textNodes()
method on the Element
class is returning only the direct child text nodes, and not the grand text nodes, great-grand text nodes etc.
Given the following sample HTML file:
<html>
<head/>
<body>
<div class="erece mtmhp">
<a href="http://www.stackoverflow.com">
<span>Content 1</span>
<span>Content 2</span>
</a>
</div>
</body>
</html>
I would like to be able to retrieve Content 1
and Content 2
, but separately.
Here is my sample code:
import java.io.File;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
public class TextNodeExpt {
public static void main(String[] args) throws Exception {
File fileObj = new File(args[0]);
Document document = Jsoup.parse(fileObj, "UTF-8");
Elements divs = document.select("div.erece.mtmhp");
displaySubtendingTextNodes(divs);
}
protected static void displaySubtendingTextNodes(Elements divs) {
for (Element div : divs) {
Elements anchors = div.select("a");
for (Element anchor : anchors) {
List<TextNode> textNodes = anchor.textNodes();
System.out.println(textNodes.size() + " text nodes found");
for(TextNode tn : textNodes) {
System.out.println("[" + tn.getWholeText() + "]");
}
}
}
}
}
Addendum:
Based on the comment from Hovercraft Full Of Eels, I have come up with the following implementation. It is based on this Stackoverflow posting.
I have made use of List<TextNode>
in the API, similar to the textNodes()
method on the Element
class (although the list is passed as a parameter, and not returned from the method).
protected static void textNodes(Node targetNode, List<TextNode> nodeList) {
for (Node childNode : targetNode.childNodes()) {
if (childNode instanceof TextNode && !((TextNode) childNode).isBlank()) {
nodeList.add((TextNode)childNode);
}
else {
textNodes(childNode, nodeList);
}
}
}