XPath: How to collect multiple texts fragments from an XHTML node?

Question

I am trying to extract text from a node in an XHTML page using XPath but I am having trouble to collect ALL the text below a given node.

The problem is, that a node (see for example the p-element in the example below) can have multiple children nodes (in the example e.g. "b" and "em") and multiple text fragments interspersed ("aaaa", "bbbb" and "cccc"). My XPath expression "p/text()", however, returns me only the first text "aaaa" while I need to collect ALL text fragments directly underneath the p-node, i.e. I want to obtain "aaaabbbbcccc" (but not foo and bar). How do I teach XPath to collect ALL texts and return them as one concatenated string?

...
<p>
  aaaa
  <b>foo</b>
  bbbb
  <em>bar</em>
  cccc
</p>
...

Alternatively: what would be the XPath expression to get a list of all text-fragments, so I can concatenate them programmaticallyin my code?

This really depends on the version of XPath and on the tool/environment/programming language you use. Please edit your post and include this information. — Mathias Müller, Feb 23 '15 at 23:28
Thanks both for responding! Glad to see that the problem is not my XPath expression - I was really scratching my head! Re. tools and environments used: I am using JTidy r938 to parse the (X)HTML-pages and to generate the DOM and Java 1.8's built-in XPath implementation (package javax.xml.xpath) to locate the nodes. Apparently the latter only returns the first text value if the return type is STRING and not all of them concatenated. If I return a NODESET I do indeed get a list of all texts, which I then need to concat in my code. I had hoped, that XPath could do that for me. — mmo, Feb 24 '15 at 10:14
Not very familiar with Java, but you could look for text nodes while incrementing their position. Start selecting `//p/text()[1]`, then try `//p/text()[2]` and so on, until the result set is empty. (For your future questions, please include this information right away and tag the question with the programming language you use.) — Mathias Müller, Feb 24 '15 at 10:17
done. :-) And - yes - adding array subscripts [1], [2] etc. to the expression does indeed return the individual text snippets. But there seems no syntax to get them all in one string. Thanks a lot! — mmo, Feb 24 '15 at 10:19

score 2 · Answer 1 · answered Feb 23 '15 at 23:37

Your XPath expression already returns all immediate children of a p element, if they are text nodes. It's just that your XPath engine or library only returns the first result.

To see that this is true, run the same XPath expression with a different engine, for instance on http://xpath.online-toolz.com/tools/xpath-editor.php. There, using

<p>
  aaaa
  <b>foo</b>
  bbbb
  <em>bar</em>
  cccc
</p>

as input, and //p/text() as the path expression yields (individual results separated by --------):

[WHITESPACE-ONLY LINE]
aaaa
-----------------------
bbbb
-----------------------
cccc
[WHITESPACE-ONLY LINE]

If you don't mind the text inside the children of p also being output, you could use

string(//p)

which would yield

[WHITESPACE-ONLY LINE]
aaaa
foo
bbbb
bar
cccc
[WHITESPACE-ONLY LINE]

To get exactly the output you requested, you need to give more information (see the comment to your question).

Andersnk · Answer 2 · 2015-02-23T22:48:18.373

1

If I copy your sample XML into Notepad++ and use XPathenizer, the XPath expression /p/text() works fine.

enter image description here

This indicates that the XPath expression is fine and the fault lies elsewhere.

edited Feb 23 '15 at 22:48

answered Feb 23 '15 at 22:40

Andersnk

857
11
25

That's a very cool feature of Notepad++! Makes it worth considering it as my future text editor... – mmo Feb 25 '15 at 00:34

XPath: How to collect multiple texts fragments from an XHTML node?

2 Answers2

Linked