how to crawl a site only given domain url with scrapy

Question

I am trying to use scrapy for crawling a website, but there's no sitemap or page indices for the website. How can I crawl all pages of the website with scrapy?

I just need to download all the pages of the site without extracting any item. Do I only need to set following all links in the Rule of Spider? But I don't know whether or not scrapy will avoid replicate urls in this way.

Why not just loop through all links on the website and crawl away? — , Jan 05 '13 at 23:52
@enginefree Looping throught all links is the feasible way, but I don't know how to set it with scrapy. — David Thompson, Jan 06 '13 at 21:49
If you don't want to scrap items then why u want to use scrapy. just use any website downloader and it will do everything for you — Mirage, Jan 07 '13 at 02:06
@user1937 I have other python code to parse the html response — David Thompson, Jan 12 '13 at 19:20

score 5 · Accepted Answer · edited Jul 02 '19 at 20:45

5

I just found the answer myself. With the CrawlSpider class, we just need to set variable allow=() in the SgmlLinkExtractor function. As the documentation says:

allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.

edited Jul 02 '19 at 20:45

jpyams

4,030
9
41
66

answered Jan 12 '13 at 19:24

David Thompson

149
2
7

score 5 · Answer 2 · answered Jul 02 '19 at 18:19

In your Spider, define allowed_domains as a list of domains you want to crawl.

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']

Then you can use response.follow() to follow the links. See the docs for Spiders and the tutorial.

Alternatively, you can filter the domains with a LinkExtractor (like David Thompson mentioned).

from scrapy.linkextractors import LinkExtractor

class QuotesSpider(scrapy.Spider):

    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/page/1/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        for a in LinkExtractor(allow_domains=['quotes.toscrape.com']).extract_links(response):
            yield response.follow(a, callback=self.parse)

how to crawl a site only given domain url with scrapy

2 Answers2