3

I would like to extract the URL of this type (link text is a number with any number of digits and href is a random text) using an XPath in Scrapy.

  • <a href="http://www.example.com/link_to_some_page.html>3</a>
  • <a href="http://www.example.com/another_link-abcd.html>45</a>

I could think of something like

HtmlXPathSelector(response).select('//a[matches(text(),"\d+")]/@href')

However it appears that XPath 2.0 isn't supported and I can't use regex.

The best single line solution I could search was from this question: xpath expression for regex-like matching? - Is there a better way in scrapy to achieve this?

Community
  • 1
  • 1
user
  • 17,781
  • 20
  • 98
  • 124

1 Answers1

3
.select('//a[. != "" and translate(., "0123456789", "") = ""]/@href')
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • +1 smart answer. @Tomalak how about exactly matching a string while ignoring the case in Xpath 1.0 ? E.g. if my string is "Next". I would use something like (r"^next$", re.I). How would I do that without regex? – user Jun 20 '11 at 06:34
  • @buffer: That's been asked before, several times. Just search for it. ;) - BTW, another variant would be `... and string(number(.)) != 'NaN'`, but that would accept numerical notations beyond "digits only". – Tomalak Jun 20 '11 at 06:49