0

I have written my own scrapy download middleware to simply check db for exist request.url, if so raise IgnoreRequestf

def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        sql = """SELECT url FROM domain_sold WHERE url = %s;"""

        try:

            cursor = spider.db_connection.cursor()
            cursor.execute(sql, (request.url,)) 

            is_seen = cursor.fetchone()
            cursor.close()
            if is_seen:
                raise IgnoreRequest('duplicate url {}'.format(request.url))

        except (Exception, psycopg2.DatabaseError) as error:
            self.logger.error(error)

        return None

if IgnoreRequest is raised I expect the spider would continue onto another request but in my case the spider would still continue scraping that request and pipe through the item through my custom pipeline.

I currently have my setting for the dl mw as below

'DOWNLOADER_MIDDLEWARES' : { 'realestate.middlewares.RealestateDownloaderMiddleware': 99

could anyone suggest to why this is happening. Thanks

1 Answers1

1

IgnoreRequest inherits from the base Exception class which you're then immediately catching in your except and logging so it never propagates enough to actually ignore the request...

Change:

except (Exception, psycopg2.DatabaseError) as error:

To:

except psycopg2.DatabaseError as error:
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • This is correct, but a better concise answer would be to remove the try/except since `process_request` should either: return None, return a Response object, return a Request object, or raise IgnoreRequest. (I.e no need to catch the error) – wishmaster Mar 27 '20 at 03:09
  • 1
    @wishmaster That'll mean any DB exceptions would get loose and not explicitly get logged... looks like the above will always return None or raise IgnoreRequest anyway... (failing any other exception potentially occurring...) eg: it looks like the OP wanted to log DB exceptions and not let 'em propagate but got a little overzealous with a rather broad ranging `Exception` in their except clause – Jon Clements Mar 27 '20 at 03:13
  • @JonClements thank you. Your solution resolved the problem I had –  Mar 27 '20 at 21:53