7

Please note: This question is a more refined version of a previous question.

I am looking for an XPath that lets me find elements with a given plain text in an HTML document. For example, suppose I have the following HTML:

<html>
<head>...</head>
<body>
    <someElement>This can be found</someElement>
    <nested>
        <someOtherElement>This can <em>not</em> be found most nested</someOtherElement>
    </nested>
    <yetAnotherElement>This can <em>not</em> be found</yetAnotherElement>
</body>
</html>

I need to search by text and am able to find <someElement> using the following XPath:

//*[contains(text(), 'This can be found')]

I am looking for a similar XPath that lets me find <someOtherElement> and <yetAnotherElement> using the plain text "This can not be found". The following does not work:

//*[contains(text(), 'This can not be found')]

I understand that this is because of the nested em element that "disrupts" the text flow of "This can not be found". Is it possible via XPaths to, in a way, ignore such or similar nestings as the one above?

Community
  • 1
  • 1
Michael Herrmann
  • 4,832
  • 3
  • 38
  • 53

1 Answers1

11

You can use

//*[contains(., 'This can not be found')]
   [not(.//*[contains(., 'This can not be found')])]

This XPath consists of two parts:

  1. //*[contains(., 'This can not be found')]: The operator . converts the context node to its string representation. This part therefore selects all nodes that contain 'This can not be found' in their string representation. In the above example, this is <someOtherElement>, <yetAnotherElement> and: <body> and <html>.
  2. [not(.//*[contains(., 'This can not be found')])]: This removes nodes with a child element that still contains the plain text 'This can not be found'. It removes the unwanted nodes <body> and <html> in the above example.

You can try these XPaths out here.

Michael Herrmann
  • 4,832
  • 3
  • 38
  • 53