xpath descendant and descendant-or-self work completely different

Question

I try to find all seconds tds among the descendants of div with the specified id, i.e. 22 and 222. The first solution that comes to my mind was:

//div[@id='indicator']//td[2]

but it selects only the first table cell, i.e. 22 but not both 22 and 222. Then I replaced // with /descendant-or-self::node()/ and got the same result (obviously). But when I removed '-or-self' the xpath expression started to work as expected

 test1 = test_tree.xpath(u"//div[@id='indicator']/descendant-or-self::node()/td[2]")
 print len(test1) #prints 1 (first one: 22)

 test1 = test_tree.xpath(u"//div[@id='indicator']/descendant::node()/td[2]")
 print len(test1) #prints 2 (22 and 222)

Here is test HTML

<html>
    <body>
        <div id='indicator'>
            <table>
               <tbody>
                    <tr>
                        <th>1</th>
                        <th>2</th>
                        <th>3</th>
                    </tr>
                    <tr>
                        <td>11</td>
                        <td>22</td>
                        <td>33</td>
                    </tr>
                    <tr>
                        <td>111</td>
                        <td>222</td>
                        <td>333</td>
                    </tr>
                </tbody>
            </table>
        </div>
    </body>
</html>

I'm wondering why both expressions don't work identically since all the tds are descendants of div element no matter div included or not.

all three xpathes on xpath testers give 2 elements in output — splash58, Jul 29 '15 at 08:55
lol. here is my output. exactly the same code but different results http://imgur.com/fZCL6nH — Anton Kolokolcev, Jul 29 '15 at 09:46
One more notice: Selenium IDE also highlights only the first td[2] while FireFinder extension for Firebug shows both :( — Anton Kolokolcev, Jul 29 '15 at 09:49
I replicated your example on a local server with an HTML page that contains your HTML example. I am using scrapy, so my selector is LXML Xpath selector. I used this xpath value `.//div[@id='indicator']//tr/td[2]` and it gives me correct results `[u'22', u'222']` — William Kinaan, Jul 29 '15 at 14:17
@WilliamKinaan Yes. Adding a tr parent also works in my case but I'm just wondering why it doesn't work simply as //td[2] — Anton Kolokolcev, Jul 29 '15 at 14:45

score 1 · Answer 1 · answered Jul 29 '15 at 12:23

1

I think you have found a bug in your XPath processor.

answered Jul 29 '15 at 12:23

Michael Kay

156,231
11
92
164

I've posted a bug report to lxml lib bug tracker. Probably related to Python + lxml + Windows interoperability – Anton Kolokolcev Jul 29 '15 at 14:51

score 0 · Answer 2 · edited May 23 '17 at 10:24

0

I developed a web page contains the HTML you provided in your question.

When you use this xpath:

.//div[@id='indicator']//tr/td[2]

It works as expected and the result is:

[u'<td>22</td>', u'<td>222</td>']

However, according to your comment, you were asking when .//td[2] doesn't work. The reason is .//td gives you a list of all the td(s) in your DOM. Adding an index such as [2] will result in the second td in that list

To sum up: These are the results of applying .//td and .//td[2] respectively:

and if you want to take the text inside these tds, you should add /text() as the following:

Update:

The OP said:

So why then //div[@id='indicator']/descendant::node()/td[2] produces ['22', '222']? According to your comment: "Adding an index such as [2] will result in the second td in that list" it should populate only ['22'].

I will try to explain what is going on here:

descendant:node() doesn't equal to //
the equal to // is: descendant-or-self::node()
It is explained at W3C specification:

I hope this code could help you:

edited May 23 '17 at 10:24

Community

1
1

answered Jul 29 '15 at 18:50

William Kinaan

28,059
20
85
118

So why then `//div[@id='indicator']/descendant::node()/td[2]` produces ['22', '222']? According to your comment: "Adding an index such as [2] will result in the second td in that list" it should populate only ['22']. – Anton Kolokolcev Jul 30 '15 at 09:21
OK. I'll try to explain again: `test = test_tree.xpath(u"//div[@id='indicators_minimize']/descendant-or-self::node()")[0] print etree.tostring(test, encoding='cp866', pretty_print=True)` Result: `

1 2 3

11 22 33

111 222 333

` – Anton Kolokolcev Jul 30 '15 at 13:05
Then: `test = test_tree.xpath(u"//div[@id='indicators_minimize']/descendant::node()")[0] print etree.tostring(test, encoding='cp866', pretty_print=True)` Result: `

1 2 3

11 22 33

111 222 333

` – Anton Kolokolcev Jul 30 '15 at 13:10
The only difference is the first xml code is incapsulated into div tag (because of -or-self statement). But we don't care about div parent or any other parent depth level because we're looking for descendants. After expression evaluation we don't care where to find tds: in div descendants or in table descendants (both of them contain needed tds). However the results differ. On my machine at least. Have a look at the first two answers here: all three xpath expressions return the same values. – Anton Kolokolcev Jul 30 '15 at 13:23
@AntonKolokolcev again what is your question? you already asked two different questions and I answered both of them. Kindly be specific, i am trying to help here. – William Kinaan Jul 30 '15 at 13:30
my question is: why statements `//div[@id='indicator']/descendant-or-self::node()/td[2]` and `//div[@id='indicator']/descendant::node()/td[2]` produce different results? Just for the reference: The descendant axis contains the descendants of the context node; a descendant is a child or a child of a child and so on. The descendant-or-self axis contains the context node and the descendants of the context node. – Anton Kolokolcev Jul 30 '15 at 14:08
they produce different results because they are **different expressions**. The mean of them is (as i mentioned before) written [here](http://www.w3.org/TR/xpath/#path-abbrev) . *Again*: NOTE: The location path //para[1] does not mean the same as the location path /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their parents. **did you get that note?** – William Kinaan Jul 30 '15 at 14:18
Yes, I got this note. 3 more comments: 1) First you said "The reason is .//td gives you a list of all the td(s) in your DOM. Adding an index such as [2] will result in the second td in that list" and in your last comment: "the former (//para[1]) selects all descendant para elements that are the first para children of their parents." which is totally opposite to your first comment. 2) Refer to the image from the second answer: http://i.imgur.com/32WRNHs.png and then to my image: http://m.imgur.com/fZCL6nH. Completely identical code produces different results on different machines – Anton Kolokolcev Jul 30 '15 at 15:00
3) //para[1] does not mean the same as /descendant::para[1] because //para[1] means /descendant-or-self::node()/para[1] – Anton Kolokolcev Jul 30 '15 at 15:02
1) the both description i gave are the same, they are not opposite, (try to think more about it). 2)the second image is correct. The first one is not. However, you would need to check which XPath version is the parser using. Even though it could be a bug, the important thing is that you have the information to understand the **two** XPath expressions correctly. 3) for your third point, i already listed that in my answer. Best luck in your project – William Kinaan Jul 30 '15 at 15:26

score 0 · Answer 3 · edited Jul 30 '15 at 09:57

I think I've found the cause of this issue:

http://www.w3.org/TR/xpath20/#id-errors-and-opt

"In some cases, a processor can determine the result of an expression without accessing all the data that would be implied by the formal expression semantics. For example, the formal description of filter expressions suggests that $s[1] should be evaluated by examining all the items in sequence $s, and selecting all those that satisfy the predicate position()=1. In practice, many implementations will recognize that they can evaluate this expression by taking the first item in the sequence and then exiting."

So there is no remedy. It's xpath processor implementation dependent however I still don't understand why //div[@id='indicator']/descendant-or-self::node()/td[2] and //div[@id='indicator']/descendant::node()/td[2] produce different results.

1	2	3
11	22	33
111	222	333

1	2	3
11	22	33
111	222	333

xpath descendant and descendant-or-self work completely different

3 Answers3

Update: