0

I am trying to get the text from a node, but the text from it's child nodes are being appended. I want to avoid it.

I transformed HTML to XML using HTMLCleaner and I have something similar to this

<td>
    <a>Link Text</a>
    Column Text
</td>

I want only Column Text to be extracted. Avoiding any text existing in the children of the selected td Is there any way to do that? The one I used thus far was this:

//td/text()
Rodrigo Sasaki
  • 7,048
  • 4
  • 34
  • 49
  • What Xpath engine are you using? – BillRobertson42 Oct 01 '13 at 14:58
  • The one bundled inside `HTMLCleaner`. I am not really sure which one it is, but I imagine it's the one that comes with the JDK – Rodrigo Sasaki Oct 01 '13 at 15:02
  • HTMLCleaner provides only a "[partial implementation](http://htmlcleaner.sourceforge.net/release.php)" of XPath 1.0. It's not clear why implementers didn't use an existing solid XPath library. As it is, there's an unfortunate initial fog around whether an issue is due to user XPath mistakes or HTMLCleaner XPath implementation shortcomings. – kjhughes Oct 01 '13 at 19:31

1 Answers1

1

This XPath:

//td[a = 'Link Text']/text()[last()]

Will select "Column Text".

Be aware that if there are multiple td's with a's whose text equals "Link Text", under XPath 1.0 you'll get the last text of first such td; under XPath 2.0 you'll the last text nodes of all such td's.

Note that this would not pick up "prior text" in this example:

<td>
  prior text
  <a>Link Text</a>
  Column Text
</td>

If you want both "Column Text" and "prior text", but not "Link Text", and if you can use XPath 2.0, use this:

string-join(/td/text(), '')

(Be sure to also select the right td; I'm assuming only one here to simplify.)

For XPath 1.0, you'd have to assemble the text nodes outside of XPath.

See also "XPath to return string concatenation of qualifying child node values".

Community
  • 1
  • 1
kjhughes
  • 106,133
  • 27
  • 181
  • 240