2

I want to extract elements from the HTML page, containing text, ignoring markup. For example, I want to extract node containing the text "Run, Sarah, run!" from https://en.wiktionary.org/wiki/run. I know about node test text() and function string(). I tried them both: Firefox with the console. Searching "Run, Sarah, run!"

As you see, if I use string() it returns too many nodes (result includes the nodes that include the node I need) and if I use text() it returns nothing (because of the <b> tag).

How do I find required nodes?

UPD: I want all deepest nodes. That means if the Wikitionary page contained this sentence twice, I wanted to select two nodes.

Also, I don't know the node type.

rominf
  • 2,719
  • 3
  • 21
  • 39
  • 1
    Are you sure that you're using web-scraping tool with HTML parser that supports XPath 2.0? What is that tool? – Andersson Dec 23 '18 at 20:01
  • You are right! I messed up. Indeed, I use Splinter (based on Selenium) with Chrome webdriver. – rominf Dec 23 '18 at 20:17

1 Answers1

2

//*[contains(string(.), "Run, Sarah, run!")] returns all elements (starting from html node till last descendant node) that contains that string.

//*[contains(text(), "Run, Sarah, run!")] returns nothing as "Run, Sarah, run!" is compound text from several text nodes, but not from single text node

You can use below to match italic node with required text:

'//i[normalize-space()="Run, Sarah, run!"]'

If you don't want to specify node name, you can try

'//*[normalize-space()="Run, Sarah, run!" and not(./*[normalize-space()="Run, Sarah, run!"])]'
Andersson
  • 51,635
  • 17
  • 77
  • 129
  • Thank you for an explanation. I think that the last variant is what I need. But why does it called `normalize-space` if it strips markup?) Also, I didn't find anything about this behaviour in docs: https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space – rominf Dec 23 '18 at 19:46
  • Also, there is some problem with the last solution: I want to get all deepest nodes. That means if the page contains several separate nodes containing the text, I want them all! – rominf Dec 23 '18 at 19:48
  • 1
    @rominf , I'm not sure I understand what is your desired output... Do you mean you want an array of nodes that contain text? – Andersson Dec 23 '18 at 19:51
  • Yep, I selected this example for simplicity. See UPD. – rominf Dec 23 '18 at 19:52
  • Sorry, I have to undo accepting your answer. Probably it's my fault that I didn't write all the requirements in the first place. – rominf Dec 23 '18 at 19:57
  • Thank you. Not very beautiful, but it works and I can produce the XPath using string formatting and make it less ugly. – rominf Dec 23 '18 at 20:15
  • I don't like `normalize-space` (what a strange name!), I've used the same logic with `contains(string(.))`. – rominf Dec 23 '18 at 20:19
  • I'm getting identical results. What's the difference in my problem? – rominf Dec 23 '18 at 20:25
  • 1
    @rominf , I guess in your current case there is no difference. Actually XPath can be even shorter: `//*[.="Run, Sarah, run!" and not(./*[.="Run, Sarah, run!"])]` – Andersson Dec 23 '18 at 20:40