0

I am trying to write an XPath expression which can return the URL associated with the next page of a search.

The URL which leads to the next page of the search is always the href in the a tag following the tag span class="navCurrentPage" I have been trying to use a following-sibling term to pull the next URL. My search in the Chrome console is:

$x('//span[@class="navCurrentPage"][1]/following-sibling::a/@href[1]')

I thought by specifying @href[1] I would only get back one URL (thinking the [1] chooses the first element in list), but instead Chrome (and Scrapy) are returning four URLs. I don't understand why. Please help me to understand how to select the one URL that I am looking for.

Here is the URL where you can find the HTML giving me trouble:

https://www.yachtworld.com/core/listing/cache/searchResults.jsp?cit=true&slim=quick&ybw=&sm=3&searchtype=advancedsearch&Ntk=boatsEN&Ntt=&is=false&man=&hmid=102&ftid=101&enid=0&type=%28Sail%29&fromLength=35&toLength=50&fromYear=1985&toYear=2010&fromPrice=&toPrice=&luom=126&currencyid=100&city=&rid=100&rid=101&rid=104&rid=105&rid=107&rid=108&rid=112&rid=114&rid=115&rid=116&rid=128&rid=130&rid=153&pbsint=&boatsAddedSelected=-1

Thank you for the help.

kenlukas
  • 3,616
  • 9
  • 25
  • 36
smeesn
  • 21
  • 1
  • 4

4 Answers4

1

Operator precedence: //x[1] means /descendant-or-self::node()/child::x[1] which finds every descendant x that is the first child of its parent. You want (//x)[1] which finds the first node among all the descendants named x.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thanks. This syntax works within scrapy and the chrome console. If you can clarify how the brackets work within xpath expressions it would be a big help. Chrome seems to use them extensively when identifying elements (for instance, the following xpath is provided by chrome when identifying the individual element: "//*[@id="searchResultsHeader"]/div[1]/span[3]/a[1]" I'm not sure why these brackets are valid and my code's are not. – smeesn Sep 03 '19 at 13:52
  • That example doesn't use a numeric predicate in a step starting with "//", which is where the complications in your example arise. See the note under bullet item 3 in the spec §3.3.5: https://www.w3.org/TR/xpath-31/#abbrev – Michael Kay Sep 03 '19 at 14:41
0

xpath index will apply on all matching records, if you want to get only the first item, get the first instance.

$x('//span[@class="navCurrentPage"][1]/following-sibling::a/@href[1]').extract_first()
Ed Bangga
  • 12,879
  • 4
  • 16
  • 30
0

just add, .extract_first() or .get() to fetch the first item.

see the scrapy documentation here.

Vicky T
  • 3
  • 3
0

I've found this very helpful to make sure you have the bracket in the right place. What is the XPath expression to find only the first occurrence? also, the first occurrence may be [0] not [1]

reg202
  • 160
  • 7