11

Hope someone out there can quickly point me in the right direction with my XPath difficulties.

Current I've got to the point where I'm identifying the correct table i need in my HTML source but then I need to process only the rows that have the text 'Chapter' somewhere in the DOM.

My last attempt was to do this :

// get the correct table
HtmlTable table = page.getFirstByXPath("//table[2]");

// now the failing bit....
def rows = table.getByXPath("*/td[contains(text(),'Chapter')]") 

I thought the xpath above would represent, get me all elements that have a following child element of 'td' that somewhere in its dom contains the text 'Chapter'

An example of a matching row from my source is :

<tr valign="top">
  <td nowrap="" align="Right">
   <font face="Verdana">
   <a href="index.cfm?a=1">Chapter 1</a>
   </font>
  </td>
  <td class="ChapterT">
    <font face="Verdana">DEFINITIONS</font>
  </td>
  <td>&nbsp;</td>
</tr>

Any help / pointers greatly appreciated.

Thanks,

Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125
David Brown
  • 3,021
  • 3
  • 26
  • 46

3 Answers3

20

Use this XPath:

//td[contains(., 'Chapter')]
Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125
  • Thanks, that appears to work. What does the '.' represent? Also I don't understand why the 'reletive' detection isn't working, e.g. you have the // which as I understand means begin at the root? – David Brown Mar 10 '12 at 13:17
  • 1
    @Dave, You're welcome. `.` and `//` is XPath abbreviated syntax. `.` selects the context node. `//td` selects all the `td` descendants of the document root and thus selects all `td` elements in the same document as the context node. *Reference*: http://www.w3.org/TR/xpath/#path-abbrev – Kirill Polishchuk Mar 10 '12 at 16:24
9

You want all tds under your current node -- not - all in the document as the currently accepted answer selects.

Use:

.//td[.//text()[contains(., 'Chapter')]]

This selects all td descendants of the current node that are named td that have at least one text node descendant, whose string value contains the string "Chapter".

If it is known in advance that any td under this table only has a single text node, this can be simplified to just:

.//td[contains(., 'Chapter')]
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
2

Your on the right "path".
The contains() function is limited the a specific element, not text in any of the children. Try this XPath, which you could read as follows: - get every tr/td with any sub element that contains the text 'Chapter'

tr/td[contains(*,"Chapter")]

Good luck

William Walseth
  • 2,803
  • 1
  • 23
  • 25
  • Hi William, gave it a go but couldn't get it to return anything. What has worked, although doesn't seem the most efficient is a single liner of ' def chapterAnchors = page.anchors.findAll {HtmlAnchor a -> a.asText().contains('Chapter')} ' – David Brown Mar 10 '12 at 04:29