4

I can't find any solution for using start_requests with rules, also I haven't seen any example on the Internet with this two. My purpose is simple, I wanna redefine start_request function to get an ability catch all exceptions dunring requests and also use meta in requests. This is a code of my spider:

class TestSpider(CrawlSpider): name = 'test' allowed_domains = ['www.oreilly.com'] start_urls = ['https://www.oreilly.com/library/view/practical-postgresql/9781449309770/ch04s05.html']

# Base on scrapy doc
def start_requests(self):
    for u in self.start_urls:
        yield Request(u, callback=self.parse_item, errback=self.errback_httpbin, dont_filter=True)

rules = (
    Rule(LinkExtractor(), callback='parse_item', follow=True),
)

def parse_item(self, response):
    item = {}
    item['title'] = response.xpath('//head/title/text()').extract()
    item['url'] = response.url
    yield item

def errback_httpbin(self, failure):
    self.logger.error('ERRRRROR - {}'.format(failure))

This code scrape only one page. I try to modify it and instead of:

def parse_item(self, response):
    item = {}
    item['title'] = response.xpath('//head/title/text()').extract()
    item['url'] = response.url
    yield item

I've tried to use this, based on this answer

def parse_item(self, response):
    item = {}
    item['title'] = response.xpath('//head/title/text()').extract()
    item['url'] = response.url
    return self.parse(response) 

It seems to work, but it doesn't scrape anything, even if I add parse function to my spider. Does anybody know how to use start_request and rules together? I will be glad any information about this topic. Have a nice coding!

NashGC
  • 659
  • 8
  • 17
  • somebody, any ideas? – NashGC Jun 23 '19 at 17:10
  • I asked a similar question last week, but couldn't find a way either. This was the question https://stackoverflow.com/questions/56616527/scrapy-linkextractor-in-control-flow-and-why-it-doesnt-work – gunesevitan Jun 23 '19 at 19:03
  • @gunesevitan, have you seen this [answer](https://stackoverflow.com/questions/38280133/scrapy-rules-not-working-when-process-request-and-callback-parameter-are-set)? This stuff get my Rules work but it doesn't crawl enething because parse func is empty. If I redefine parse func it still doesn't work( – NashGC Jun 24 '19 at 05:59

3 Answers3

1

I found a solution, but frankly speaking I don't know how it works but it sertantly does it.

class TSpider(CrawlSpider):
    name = 't'
    allowed_domains = ['books.toscrapes.com']
    start_urls = ['https://books.toscrapes.com']
    login_page = 'https://books.toscrapes.com'

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def start_requests(self):
        yield Request(url=self.login_page, callback=self.login, errback=self.errback_httpbin, dont_filter=True)

    def login(self, response):
        return FormRequest.from_response(response)

    def parse_item(self, response):
        item = {}
        item['title'] = response.xpath('//head/title/text()').extract()
        item['url'] = response.url
        yield item

    def errback_httpbin(self, failure):
        self.logger.error('ERRRRROR - {}'.format(failure))
NashGC
  • 659
  • 8
  • 17
0

To catch errors from your rules you need to define errback for your Rule(). But unfortunately this is not possible now.

You need to parse and yield request by yourself (this way you can use errback) or process each response using middleware.

gangabass
  • 10,607
  • 2
  • 23
  • 35
0

Here is a solution for handle errback in LinkExtractor

Thanks this dude!

NashGC
  • 659
  • 8
  • 17