2

From different websites, the XPath syntax provided are different, primarily the need of the "/text()" suffix.

Citing syntax without the need of suffix:

Citing syntax with the need of suffix:

As far as I am aware, different libraries also work only with or without suffix (I didn't encounter one before that works both with and without).

Requires no suffix:

Requires suffix:

  • Java JRE native XPath implementation

It would seem, most likely, there is a difference between XPath library implementation meant for use with XML and for use with DOM? If so, what are the difference and where can I find the difference?

  • 2
    It's like "I see some people eating food with forks, and some with spoons. Which is correct?" Depends if you're eating soup or steak. If you need text nodes, you use `text()`. If you're not looking at text nodes, you won't. It's nothing about correctness or requiring or implementations, it's just about what the code needs. You can find about the difference by learning about XML, DOM and XPath (not from snippets, but from an actual XPath documentation, like [MDN's](https://developer.mozilla.org/en-US/docs/Web/XPath)). – Amadan Aug 15 '19 at 05:13

2 Answers2

3

I think you have misdiagnosed the situation, and the reason for the misdiagnosis (to stretch an analogy much too far) is that you've looked at the symptoms of about 7 patients rather than going to medical school and learning about anatomy.

The "anatomy" here is the XDM data model which underpins the semantics of XPath. Note in particular that

(a) when you have a structure like this

<title>Water</title>

there is an element node, whose string value is "Water", and which is the parent of a single text node, whose string value is also "Water".

(b) when you have a structure like this

<title>H<sub>2</sub>O</title>

there is an element node, whose string value is "H2O", which is the parent of three children: a text node with string value "H", an element node with string value "2" (which itself is the parent of another text node...), and a second text node with string value "O".

In case (a) nearly all operations produce the same result whether applied to the element node or the text node. For example contains($x, "ate") will be true whether $x is the element node or the text node. So adding /text() to the path is generally redundant: it does no harm, but it's unnecessary. We often advise against doing it, because it makes your code more fragile if the structure of the data later changes, quite apart from just adding unnecessary verbosity.

In case (b) adding /text() to your path causes you to select the two text nodes "H" and "O" instead of selecting the element node. In XPath 1.0, many operations (such as contains()) when applied to a sequence of two text nodes ignore all but the first, so contains(x/y/title/text(), "O") will return false; in XPath 2.0 it will throw an error saying that the argument to contains() must be a singleton. If you simply want to know whether the title contains the letter "O", then it's much better to leave out the /text() and apply the operation to the string value of the element, which is the concatenation of all the text nodes.

The only time you need to use "/text()" is if you want to probe more deeply into the internal structure of the title element.

It is of course possible that there are differences between XPath implementations - not all of them have 100% conformance to the standard. But the mainstream implementations are pretty compatible, and if you find a difference, please tell us about it: be explicit about the source document, the path expression, and the different results obtained in different implementations.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

If you look at the relevant specifications then you will find that both the XPath 1.0 https://www.w3.org/TR/xpath-10/#node-tests as well as the XPath 2.0 specification https://www.w3.org/TR/xpath20/#node-tests define what you call a "suffix" as a "node test" text() used to select any "text node".

None of the specifications make the use of text() a requirement but of course it is an option the language has and needs to select text nodes, for instance with mixed contents of elements and text and/or comments where you have a reason only to select the text node children.

As for implementations, I don't think Java's XPath 1.0 implementation requires you to use it, the only reason some older DOM specific code uses foo/text() instead of simply foo to then read out the string contents inside of an element of e.g. <foo>some example</foo> is that with older DOM implementations, if you select an Element node, you have no property or method to access the text contents of the element as a string, therefore people used foo/text() to select the Text child node of the Element and could then use the nodeValue property (Javascript) or the getNodeValue() method (Java) to get a string with some example. However, for years DOM provides a property textContent on Element nodes so these days, you can use foo as an XPath and get an Element node and read out textContent or getTextContent() respectively to have the string some example.

The MSXML DOM and XPath is also rather old and has never been updated to the DOM Level 3 W3C specification but Microsoft from the beginning had its own proprietary .text property on element nodes you can use there instead of the standardized textContent. Nevertheless in that context I have seen similar attempts to explicitly read out foo/text() as a node list on which you can then access the nodeValue of each text node as a string.

The only implementation specific "preference" to use foo/text() instead of foo I have seen is in Python's lxml library if you want a direct mapping of the XPath selection to a list of Python strings, in that case an expression like foo/text() in the context of e.g. <data><foo>a</foo><foo>b</foo></data> would give you on the Python side a list of two Python strings with a and b while using foo would give you a list with two element nodes. So depending on your needs on the host language side in that case it can be easier to use foo/text() but you need to be aware that an input like <data><foo>a<!-- comment -->b</foo><foo>c</foo></data> will give you a list with three strings.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110