3

given the following div element

<div class="info">
    <a href="/s/xyz.html" class="title">title</a>
    <span class="a">123</span>
    <span class="b">456</span>
    <span class="c">789</span>
</div>

I want to retrieve contents of the span with class "b". However, some divs I want to parse lack the second two spans (of class "b" and "c"). For these divs, I want the contents of the span with class "a". Is it possible to create a single XPath expression that selects this?

If it is not possible, is it possible to create a selector that retrieves the entire contents of the div? ie retrieves

<a href="/s/xyz.html" class="title">title</a>
<span class="a">123</span>
<span class="b">456</span>
<span class="c">789</span>

If I can do that, I can use a regex to find the data I want. (I can select the text within the div, but I'm not sure how to select the tags also. Just the text yields 123456789.)

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
jela
  • 1,449
  • 3
  • 23
  • 30

3 Answers3

2

More efficient -- requires no union:

   //div/span
          [@class='b'
           or
             @class='a'
            and
             not(parent::*[span[@class='b']])
           ]

An expression (like the one below) that is the union of two absolute "// expressions", typically performs two complete document tree traversals and then the union operation does deduplication and sorting in document order -- all this can be signifficantly less efficient than a single tree traversal, unless the XPath processor has an intelligent optimizer.

An example of such inefficient expression:

//div/span[@class='b'] | //div[not(./span[@class='b'])]/span[@class='a'] 

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "//div/span
          [@class='b'
           or
             @class='a'
            and
             not(parent::*[span[@class='b']])
           ]"/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document:

<div class="info">
    <a href="/s/xyz.html" class="title">title</a>
    <span class="a">123</span>
    <span class="b">456</span>
    <span class="c">789</span>
</div>

The Xpath expression is evaluated and the selected elements (in this case just one) are copied to the output:

<span class="b">456</span>

When the same transformation is applied on a different XML document, where there is no class='b':

<div class="info">
    <a href="/s/xyz.html" class="title">title</a>
    <span class="a">123</span>
    <span class="x">456</span>
    <span class="c">789</span>
</div>

the same XPath expression is evaluated and the correctly selected element is copied to the output:

<span class="a">123</span>
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
1

The xpath expression should be something like:

//div/span[@class='b'] | //div[not(./span[@class='b'])]/span[@class='a']

The expression left of the union operator | will select you all the b-class spans inside all divs, the expression on the right hand side will first query all divs that do not have a b-class span and then select their a-class span. The | operator combines the results of the two sets.

See here for selecting nodes with not() and here for combining results with the | operator.

Also, to refer to the second part of your question have a look here. Using node() in your xpath you can select everything (nodes + text) that is below the node selected. So you can get everything in the div returned by

//div/node()

for future processing by other means.

Community
  • 1
  • 1
inVader
  • 1,504
  • 14
  • 25
  • thanks this is very instructive. I noticed that `not(span[@class='b'])` seems to work like `not(./span[@class='b'])` (the former being the syntax from the `not()` link you provided). Is there a difference between the two? – jela Jul 11 '12 at 20:30
  • 1
    It's quite a while since I have last used xpath so I copied that from the link to be on the safe side. I think it should be equivalent in requirang that span[@class='b'] has to be a direct child of the div. If you use .// on the other hand, you will get any span[@class='b'] below your div in the DOM, even if it's the child of a child. But if you want to know for sure, have a more detailed look at the xpath manual in the w3school link. – inVader Jul 11 '12 at 20:35
0

An expression that works on your input without the union operator:

//div/span[@class='a' or @class='b'][count(../span[@class='b']) + 1]

This is just for fun. I'd probably use something more like @inVader's answer in production code.

Wayne
  • 59,728
  • 15
  • 131
  • 126