how do I formulate this xpath expression?

Question

given the following div element

<div class="info">
    <a href="/s/xyz.html" class="title">title</a>
    <span class="a">123</span>
    <span class="b">456</span>
    <span class="c">789</span>
</div>

I want to retrieve contents of the span with class "b". However, some divs I want to parse lack the second two spans (of class "b" and "c"). For these divs, I want the contents of the span with class "a". Is it possible to create a single XPath expression that selects this?

If it is not possible, is it possible to create a selector that retrieves the entire contents of the div? ie retrieves

<a href="/s/xyz.html" class="title">title</a>
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>

If I can do that, I can use a regex to find the data I want. (I can select the text within the div, but I'm not sure how to select the tags also. Just the text yields 123456789.)

Dimitre Novatchev · Answer 1 · 2012-07-12T16:01:12.090

More efficient -- requires no union:

   //div/span
          [@class='b'
           or
             @class='a'
            and
             not(parent::*[span[@class='b']])
           ]

An expression (like the one below) that is the union of two absolute "// expressions", typically performs two complete document tree traversals and then the union operation does deduplication and sorting in document order -- all this can be signifficantly less efficient than a single tree traversal, unless the XPath processor has an intelligent optimizer.

An example of such inefficient expression:

//div/span[@class='b'] | //div[not(./span[@class='b'])]/span[@class='a']

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "//div/span
          [@class='b'
           or
             @class='a'
            and
             not(parent::*[span[@class='b']])
           ]"/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document:

<div class="info">
    <a href="/s/xyz.html" class="title">title</a>
    <span class="a">123</span>
    <span class="b">456</span>
    <span class="c">789</span>
</div>

The Xpath expression is evaluated and the selected elements (in this case just one) are copied to the output:

<span class="b">456</span>

When the same transformation is applied on a different XML document, where there is no class='b':

<div class="info">
    <a href="/s/xyz.html" class="title">title</a>
    <span class="a">123</span>
    <span class="x">456</span>
    <span class="c">789</span>
</div>

the same XPath expression is evaluated and the correctly selected element is copied to the output:

<span class="a">123</span>

score 1 · Accepted Answer · edited May 23 '17 at 11:43

1

The xpath expression should be something like:

//div/span[@class='b'] | //div[not(./span[@class='b'])]/span[@class='a']

The expression left of the union operator | will select you all the b-class spans inside all divs, the expression on the right hand side will first query all divs that do not have a b-class span and then select their a-class span. The | operator combines the results of the two sets.

See here for selecting nodes with not() and here for combining results with the | operator.

Also, to refer to the second part of your question have a look here. Using node() in your xpath you can select everything (nodes + text) that is below the node selected. So you can get everything in the div returned by

//div/node()

for future processing by other means.

edited May 23 '17 at 11:43

Community

1
1

answered Jul 11 '12 at 19:47

inVader

1,504
14
25

thanks this is very instructive. I noticed that `not(span[@class='b'])` seems to work like `not(./span[@class='b'])` (the former being the syntax from the `not()` link you provided). Is there a difference between the two? – jela Jul 11 '12 at 20:30
1

It's quite a while since I have last used xpath so I copied that from the link to be on the safe side. I think it should be equivalent in requirang that span[@class='b'] has to be a direct child of the div. If you use .// on the other hand, you will get any span[@class='b'] below your div in the DOM, even if it's the child of a child. But if you want to know for sure, have a more detailed look at the xpath manual in the w3school link. – inVader Jul 11 '12 at 20:35

score 0 · Answer 3 · answered Jul 11 '12 at 21:15

0

An expression that works on your input without the union operator:

//div/span[@class='a' or @class='b'][count(../span[@class='b']) + 1]

This is just for fun. I'd probably use something more like @inVader's answer in production code.

answered Jul 11 '12 at 21:15

Wayne

59,728
15
131
126

how do I formulate this xpath expression?

3 Answers3