0

<td>
  <span>hi</span>
  <a>re</a>
  hello
</td>
I have DOM element structure as shown above. Using htmlunit, i want to extract the value "hello" only, given that i have HtmlElement object referring to "td" node. I tried using getTextContent(), but it returns "hirehello", which i dont want.
roeygol
  • 4,908
  • 9
  • 51
  • 88
user3247895
  • 473
  • 1
  • 6
  • 14

1 Answers1

1

Looking at the documentation, getTextContent clearly says it returns the text of the element and its descendants, and I don't see any other method to return just the sum of the text nodes, so I think you need a loop. E.g., assuming element refers to the td element:

StringBuffer sb = new StringBuffer(/*some appropriate size*/);
for (DomNode n : element.getChildNodes()) {
    if (n.getNodeType() == Node.TEXT_NODE) {
        sb.append(n.getTextContent());
    }
}
String text = sb.toString();

Note that the sum of the text nodes in the structure you've quoted isn't just "hello", it'll have whitespace both before and after that. If you just want "hello", you'll need to trim that off.

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • 1
    I guess `element.normalize()`, followed by `element.getLastChild().asText()` should do the trick, too. But I haven't tested it to make sure. – JB Nizet Nov 27 '16 at 10:19
  • @JBNizet: Probably, for that *specific* structure, since in this specific case it's just the one text node they're interested in. – T.J. Crowder Nov 27 '16 at 10:20