XPath "Not". Ignore branches with a specific tag

Question

I have loaded a web page into the HTML Agility Pack and have a DOM. I want to use XPATH to pull out all of the text on the page (but not the javascript found within <script> tags).

I figure I need a //text() and then a 'not' to ignore any tag within the branch that has a <script> in it.

I have tried

doc.DocumentNode.SelectNodes("//text()[not(self::script)]"))

and

doc.DocumentNode.SelectNodes("//text()[not(script)]"))

but neither work. An example of the XPath property of a node that they return is (notice the Script)

/html[1]/body[1]/div[2]/div[4]/div[1]/div[1]/div[1]/div[3]/script[1]/#text[1]

I have consulted with both of these posts.

Is it possible to do 'not' matching in XPath?

Grab all text from html with Html Agility Pack (This is a good post but it brings out the JS)

Any suggestions?

score 4 · Accepted Answer · answered Feb 28 '12 at 13:49

4

Your first attempt rejects all text nodes that are script elements, and your second rejects all text nodes that have script node children. Of course, in both cases, the condition is never true.

You haven't explained your requirements clearly, but I guess you want to reject all text nodes that have script elements as their parent, which would be

//text()[not(parent::script)]

or

//*[not(self::script)]/text()

answered Feb 28 '12 at 13:49

Michael Kay

156,231
11
92
164

This worked.... //*[not(self::script)]/text() (the other did not for some reason) Thanks! – DJA Feb 28 '12 at 21:49

XPath "Not". Ignore branches with a specific tag

1 Answers1