In your Spider
, define allowed_domains
as a list of domains you want to crawl.
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
Then you can use response.follow()
to follow the links. See the docs for Spiders and the tutorial.
Alternatively, you can filter the domains with a LinkExtractor
(like David Thompson mentioned).
from scrapy.linkextractors import LinkExtractor
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
for a in LinkExtractor(allow_domains=['quotes.toscrape.com']).extract_links(response):
yield response.follow(a, callback=self.parse)