-1

I'm trying to extract all hrefs on a page.

I have tried the following:
response.css('a::attr(href)').extract()
response.xpath('//@href').extract()

It's extracting a significant chunk of the links, but not all of them...

More concretely, I'm unable to scrape the twitter link from this site: https://www.acchain.org/

Any insight is appreciated.

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
vcovo
  • 336
  • 1
  • 3
  • 16
  • Possible duplicate of [Scrapy: Extract links and text](https://stackoverflow.com/questions/27753232/scrapy-extract-links-and-text) – parik Feb 16 '18 at 11:42

3 Answers3

4

The website uses javascript to generate some of the content, including the sidebar (generated by https://www.acchain.org/js/sidebar.js)

The simplest way to scrape these links would be executing the javascript, e.g. using a browser.
There are multiple ways you could do this, but probably the simplest is using the scrapy-splash middleware.

stranac
  • 26,638
  • 5
  • 25
  • 30
  • Ah I see. I hadn't faced this problem yet. How did you identify that they're being generated? Thank you! – vcovo Feb 16 '18 at 16:47
  • 1
    I looked at the HTML source, saw that the links weren't there and `` was empty, and noticed the javascript file. – stranac Feb 16 '18 at 17:46
  • What do you use to inspect? Because when I inspect the page and search for "twitter" it shows up immediately (thus the confusion). – vcovo Feb 16 '18 at 20:00
  • 1
    I looked at the actual source (Ctrl+U in most browsers) – stranac Feb 16 '18 at 20:21
0

You can use reference of Scrapy Tutorial to write code for this page since it involves javascript to generate the content of body.

-1

It should be //a/@href Tested on Linux bash with

xmllint --html --recover --xpath '//a/@href' test.html | sed -e 's/href/\nhref/g'

LMC
  • 10,453
  • 2
  • 27
  • 52
  • 1
    If `//@href` doesn't select a link then `//a/@href` won't select it either. – Michael Kay Feb 16 '18 at 09:27
  • @michael-kay as my comment says, it was tested so I'm positive that //a/@ref works. – LMC Feb 16 '18 at 17:58
  • The answer from @stranac says it doesn't work because some of the links are not in @href attributes. Basically, the OP was confused about whether they wanted all the "links" or all the `href` attributes, which are two different requirements. – Michael Kay Feb 16 '18 at 18:47