1

I am trying to extract text from a node in an XHTML page using XPath but I am having trouble to collect ALL the text below a given node.

The problem is, that a node (see for example the p-element in the example below) can have multiple children nodes (in the example e.g. "b" and "em") and multiple text fragments interspersed ("aaaa", "bbbb" and "cccc"). My XPath expression "p/text()", however, returns me only the first text "aaaa" while I need to collect ALL text fragments directly underneath the p-node, i.e. I want to obtain "aaaabbbbcccc" (but not foo and bar). How do I teach XPath to collect ALL texts and return them as one concatenated string?

...
<p>
  aaaa
  <b>foo</b>
  bbbb
  <em>bar</em>
  cccc
</p>
...

Alternatively: what would be the XPath expression to get a list of all text-fragments, so I can concatenate them programmaticallyin my code?

mmo
  • 3,897
  • 11
  • 42
  • 63
  • This really depends on the version of XPath and on the tool/environment/programming language you use. Please edit your post and include this information. – Mathias Müller Feb 23 '15 at 23:28
  • Thanks both for responding! Glad to see that the problem is not my XPath expression - I was really scratching my head! Re. tools and environments used: I am using JTidy r938 to parse the (X)HTML-pages and to generate the DOM and Java 1.8's built-in XPath implementation (package javax.xml.xpath) to locate the nodes. Apparently the latter only returns the first text value if the return type is STRING and not all of them concatenated. If I return a NODESET I do indeed get a list of all texts, which I then need to concat in my code. I had hoped, that XPath could do that for me. – mmo Feb 24 '15 at 10:14
  • Not very familiar with Java, but you could look for text nodes while incrementing their position. Start selecting `//p/text()[1]`, then try `//p/text()[2]` and so on, until the result set is empty. (For your future questions, please include this information right away and tag the question with the programming language you use.) – Mathias Müller Feb 24 '15 at 10:17
  • done. :-) And - yes - adding array subscripts [1], [2] etc. to the expression does indeed return the individual text snippets. But there seems no syntax to get them all in one string. Thanks a lot! – mmo Feb 24 '15 at 10:19

2 Answers2

2

Your XPath expression already returns all immediate children of a p element, if they are text nodes. It's just that your XPath engine or library only returns the first result.

To see that this is true, run the same XPath expression with a different engine, for instance on http://xpath.online-toolz.com/tools/xpath-editor.php. There, using

<p>
  aaaa
  <b>foo</b>
  bbbb
  <em>bar</em>
  cccc
</p>

as input, and //p/text() as the path expression yields (individual results separated by --------):

[WHITESPACE-ONLY LINE]
aaaa
-----------------------
bbbb
-----------------------
cccc
[WHITESPACE-ONLY LINE]

If you don't mind the text inside the children of p also being output, you could use

string(//p)

which would yield

[WHITESPACE-ONLY LINE]
aaaa
foo
bbbb
bar
cccc
[WHITESPACE-ONLY LINE]

To get exactly the output you requested, you need to give more information (see the comment to your question).

Mathias Müller
  • 22,203
  • 13
  • 58
  • 75
1

If I copy your sample XML into Notepad++ and use XPathenizer, the XPath expression /p/text() works fine.

enter image description here

This indicates that the XPath expression is fine and the fault lies elsewhere.

Andersnk
  • 857
  • 11
  • 25
  • That's a very cool feature of Notepad++! Makes it worth considering it as my future text editor... – mmo Feb 25 '15 at 00:34