I have loaded a web page into the HTML Agility Pack and have a DOM. I want to use XPATH to pull out all of the text on the page (but not the javascript found within <script>
tags).
I figure I need a //text() and then a 'not' to ignore any tag within the branch that has a <script>
in it.
I have tried
doc.DocumentNode.SelectNodes("//text()[not(self::script)]"))
and
doc.DocumentNode.SelectNodes("//text()[not(script)]"))
but neither work. An example of the XPath property of a node that they return is (notice the Script)
/html[1]/body[1]/div[2]/div[4]/div[1]/div[1]/div[1]/div[3]/script[1]/#text[1]
I have consulted with both of these posts.
Is it possible to do 'not' matching in XPath?
Grab all text from html with Html Agility Pack (This is a good post but it brings out the JS)
Any suggestions?