-1

I am trying to notice when there is a problem with the page I am scrapping. In case the response has not a valid status code, I want to write a custom value in the crawler stats so that I can return a non-zero exit code from my process. This is what I have write until now:

MySpider.py

from spiders.utils.logging_utils import inform_user

class MySpider(Spider):
    name = 'MyScrapper'
    allowed_domains = ['www.mydomain.es']
    start_urls = ['http://www.mydomain/Download.html']
    custom_settings = {
        "SPIDER_MIDDLEWARES": {
            "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None
        }
    }

    def parse(self, response):
        if response.status != 200:
            message = "ERROR {} on request.".format(response.status)
            reason = 'Status response not valid'
            inform_user(self, 'ERROR', message, close_spider=True, reason=reason)
        ...

utils/logging_utils.py

def inform_user(self, level, message, close_spider=False, reason=''):
    level = level.upper() if isinstance(level, str) else ''
    levels = {
        'CRITICAL': 50,
        'ERROR': 40,
        'WARNING': 30,
        'INFO': 20,
        'DEBUG': 10
    }
    self.logger.log(levels.get(level, 0), message)
    if close_spider:
        self.crawler.stats.set_value('custom/failed_job', 'True')
        raise ScrapyExceptions.UsageError(reason=reason)

This works as expected, however I don't think that removing the HttpErrorMiddleware is a good practice. That's why I am trying to write a custom middleware which sets the stats in the crawler:

MySpider.py

from spiders.utils.logging_utils import inform_user

class CustomHttpErrorMiddleware(HttpErrorMiddleware):    
    def process_spider_exception(self, response, exception, spider):
        super().process_spider_exception(response, exception, spider)

        if response.status != 200:
            message = "ERROR {} on request.".format(response.status)
            reason = 'Status response not valid'
            inform_user(self, 'ERROR', message, close_spider=True, reason=reason)

class MySpider(Spider):
    name = 'MyScrapper'
    allowed_domains = ['www.mydomain.es']
    start_urls = ['http://www.mydomain/Download.html']
    custom_settings = {
        "SPIDER_MIDDLEWARES": {
            "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": None,
            CustomHttpErrorMiddleware: 50
        }
    }

However, now I am calling the inform_user function on the middleware definition, so I don't have access to the Spider self object, which contains the self.logger and self.crawler objects used by the function. How can I make that Spider self object available on the middleware?

Luiscri
  • 913
  • 1
  • 13
  • 40
  • 1
    The spider `self` object is the argument named `spider` in the `process_spider_exception` method of the middleware. You can use it as `spider.logger.info(...)`. – msenior_ Mar 31 '22 at 02:14
  • I can't believe I didn't notice such an obvious thing, thank you very much. You can post the answer and I will mark it as valid. Also, could I know what the number given after the middleware (50 in my case) in the custom setttings means? Is 50 a good value or should I use another one? @msenior_ – Luiscri Mar 31 '22 at 08:20
  • Answer posted. The priority number depends on the point at which you want your middleware to kick in compared to the other built in middlewares. – msenior_ Mar 31 '22 at 15:59
  • So a highest number means a higher priority, or is it the other way around? – Luiscri Mar 31 '22 at 16:25
  • Take a look at this [answer](https://stackoverflow.com/questions/6623470/scrapy-middleware-order). It explains the order in which middlewares are executed. It is in opposite order for requests and responses. – msenior_ Mar 31 '22 at 17:02

1 Answers1

1

The spider self object is the argument named spider in the process_spider_exception method of the middleware. You can use it like below spider.logger.info(...)

msenior_
  • 1,913
  • 2
  • 11
  • 13