Select all deepest nodes with XPath 1.0 containing text, ignoring markup

Question

I want to extract elements from the HTML page, containing text, ignoring markup. For example, I want to extract node containing the text "Run, Sarah, run!" from https://en.wiktionary.org/wiki/run. I know about node test text() and function string(). I tried them both:

As you see, if I use string() it returns too many nodes (result includes the nodes that include the node I need) and if I use text() it returns nothing (because of the <b> tag).

How do I find required nodes?

UPD: I want all deepest nodes. That means if the Wikitionary page contained this sentence twice, I wanted to select two nodes.

Also, I don't know the node type.

Are you sure that you're using web-scraping tool with HTML parser that supports XPath 2.0? What is that tool? — Andersson, Dec 23 '18 at 20:01
You are right! I messed up. Indeed, I use Splinter (based on Selenium) with Chrome webdriver. — rominf, Dec 23 '18 at 20:17

Andersson · Accepted Answer · 2018-12-23T20:04:57.480

2

//*[contains(string(.), "Run, Sarah, run!")] returns all elements (starting from html node till last descendant node) that contains that string.

//*[contains(text(), "Run, Sarah, run!")] returns nothing as "Run, Sarah, run!" is compound text from several text nodes, but not from single text node

You can use below to match italic node with required text:

'//i[normalize-space()="Run, Sarah, run!"]'

If you don't want to specify node name, you can try

'//*[normalize-space()="Run, Sarah, run!" and not(./*[normalize-space()="Run, Sarah, run!"])]'

edited Dec 23 '18 at 20:04

answered Dec 23 '18 at 19:39

Andersson

51,635
17
77
129

Thank you for an explanation. I think that the last variant is what I need. But why does it called `normalize-space` if it strips markup?) Also, I didn't find anything about this behaviour in docs: https://developer.mozilla.org/en-US/docs/Web/XPath/Functions/normalize-space – rominf Dec 23 '18 at 19:46
Also, there is some problem with the last solution: I want to get all deepest nodes. That means if the page contains several separate nodes containing the text, I want them all! – rominf Dec 23 '18 at 19:48
1

@rominf , I'm not sure I understand what is your desired output... Do you mean you want an array of nodes that contain text? – Andersson Dec 23 '18 at 19:51
Yep, I selected this example for simplicity. See UPD. – rominf Dec 23 '18 at 19:52
Sorry, I have to undo accepting your answer. Probably it's my fault that I didn't write all the requirements in the first place. – rominf Dec 23 '18 at 19:57
Thank you. Not very beautiful, but it works and I can produce the XPath using string formatting and make it less ugly. – rominf Dec 23 '18 at 20:15
I don't like `normalize-space` (what a strange name!), I've used the same logic with `contains(string(.))`. – rominf Dec 23 '18 at 20:19
I'm getting identical results. What's the difference in my problem? – rominf Dec 23 '18 at 20:25
1

@rominf , I guess in your current case there is no difference. Actually XPath can be even shorter: `//*[.="Run, Sarah, run!" and not(./*[.="Run, Sarah, run!"])]` – Andersson Dec 23 '18 at 20:40

Select all deepest nodes with XPath 1.0 containing text, ignoring markup

1 Answers1