Scrapy using start_requests with rules

Question

I can't find any solution for using start_requests with rules, also I haven't seen any example on the Internet with this two. My purpose is simple, I wanna redefine start_request function to get an ability catch all exceptions dunring requests and also use meta in requests. This is a code of my spider:

class TestSpider(CrawlSpider): name = 'test' allowed_domains = ['www.oreilly.com'] start_urls = ['https://www.oreilly.com/library/view/practical-postgresql/9781449309770/ch04s05.html']

# Base on scrapy doc
def start_requests(self):
    for u in self.start_urls:
        yield Request(u, callback=self.parse_item, errback=self.errback_httpbin, dont_filter=True)

rules = (
    Rule(LinkExtractor(), callback='parse_item', follow=True),
)

def parse_item(self, response):
    item = {}
    item['title'] = response.xpath('//head/title/text()').extract()
    item['url'] = response.url
    yield item

def errback_httpbin(self, failure):
    self.logger.error('ERRRRROR - {}'.format(failure))

This code scrape only one page. I try to modify it and instead of:

def parse_item(self, response):
    item = {}
    item['title'] = response.xpath('//head/title/text()').extract()
    item['url'] = response.url
    yield item

I've tried to use this, based on this answer

def parse_item(self, response):
    item = {}
    item['title'] = response.xpath('//head/title/text()').extract()
    item['url'] = response.url
    return self.parse(response)

It seems to work, but it doesn't scrape anything, even if I add parse function to my spider. Does anybody know how to use start_request and rules together? I will be glad any information about this topic. Have a nice coding!

I asked a similar question last week, but couldn't find a way either. This was the question https://stackoverflow.com/questions/56616527/scrapy-linkextractor-in-control-flow-and-why-it-doesnt-work — gunesevitan, Jun 23 '19 at 19:03
@gunesevitan, have you seen this [answer](https://stackoverflow.com/questions/38280133/scrapy-rules-not-working-when-process-request-and-callback-parameter-are-set)? This stuff get my Rules work but it doesn't crawl enething because parse func is empty. If I redefine parse func it still doesn't work( — NashGC, Jun 24 '19 at 05:59

score 1 · Answer 1 · answered Jun 27 '19 at 16:28

I found a solution, but frankly speaking I don't know how it works but it sertantly does it.

class TSpider(CrawlSpider):
    name = 't'
    allowed_domains = ['books.toscrapes.com']
    start_urls = ['https://books.toscrapes.com']
    login_page = 'https://books.toscrapes.com'

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def start_requests(self):
        yield Request(url=self.login_page, callback=self.login, errback=self.errback_httpbin, dont_filter=True)

    def login(self, response):
        return FormRequest.from_response(response)

    def parse_item(self, response):
        item = {}
        item['title'] = response.xpath('//head/title/text()').extract()
        item['url'] = response.url
        yield item

    def errback_httpbin(self, failure):
        self.logger.error('ERRRRROR - {}'.format(failure))

this code works only if a page has form therefore it's useless. — NashGC, Jul 12 '19 at 08:33

score 0 · Answer 2 · answered Jun 24 '19 at 01:47

0

To catch errors from your rules you need to define errback for your Rule(). But unfortunately this is not possible now.

You need to parse and yield request by yourself (this way you can use errback) or process each response using middleware.

answered Jun 24 '19 at 01:47

gangabass

10,607
2
23
35

maybe I wrote not so clear, bur rules in code above don't work. – NashGC Jun 24 '19 at 05:54

score 0 · Accepted Answer · answered Jul 12 '19 at 08:35

0

Here is a solution for handle errback in LinkExtractor

Thanks this dude!

answered Jul 12 '19 at 08:35

NashGC

659
8
17

Scrapy using start_requests with rules

3 Answers3

Linked