0

I have a working spider scraping image URLs and placing them in image_urls field of a scrapy.Item. I have a custom pipeline that inherits from ImagesPipeline. When a specific URL returns a non-200 http response code (like say a 401 error). For instance, in the log files, I find

WARNING:scrapy.pipelines.files:File (code: 404): Error downloading file from <GET http://a.espncdn.com/combiner/i%3Fimg%3D/i/headshots/tennis/players/full/425.png> referred in <None>
WARNING:scrapy.pipelines.files:File (code: 307): Error downloading file from <GET http://www.fansshare.com/photos/rogerfederer/federer-roger-federer-406468306.jpg> referred in <None>

However, I am unable to capture the error codes 404, 307 etc in my custom image pipeline in the item_completed() function:

def item_completed(self, results, item, info):

    image_paths = []
    for download_status, x in results:
        if download_status:
            image_paths.append(x['path'])
            item['images'] = image_paths  # update item image path
            item['result_download_status'] = 1
        else:
            item['result_download_status'] = 0
            #x.printDetailedTraceback()
            logging.info(repr(x)) # x is a twisted failure object

    return item

Digging through the scrapy source code, inside the media_downloaded() function in files.py, I found that for non-200 response codes, a warning is logged (which explains the above WARNING lines) and then a FileException is raised.

if response.status != 200:
        logger.warning(
            'File (code: %(status)s): Error downloading file from '
            '%(request)s referred in <%(referer)s>',
            {'status': response.status,
             'request': request, 'referer': referer},
            extra={'spider': info.spider}
        )

        raise FileException('download-error')

How do I also access this response code so I can handle it in my pipeline in item_completed() function?

hAcKnRoCk
  • 1,118
  • 3
  • 16
  • 30

2 Answers2

1

If you are not quite familiar with async programming and Twisted callbacks and errbacks you can be easily confused with all those methods chaining in Scrapy's media pipelines, so the essential idea in your case is to overwrite media_downloaded such a way to handle non-200 response like this (just quick-and-dirty PoC):

class MyPipeline(ImagesPipeline):

    def media_downloaded(self, response, request, info):
        if response.status != 200:
            return {'url': request.url, 'status': response.status}
        super(MyPipeline, self).media_downloaded(response, request, info)

    def item_completed(self, results, item, info):
        image_paths = []
        for download_status, x in results:
            if download_status:
                if not x.get('status', False):
                    # Successful download
                else:
                    # x['status'] contains non-200 response code
mizhgun
  • 1,758
  • 15
  • 14
  • Thanks for your answer. But in media_downloaded, status code is always 200 since it gets called only if the download is successful (I suppose). In fact, I have tried a similar approach. I overloaded file_downloaded() instead of media_downloaded() since ImagesPipeline inherits from FilesPipeline which defines this method. Please see my approach at http://pastebin.com/bpLKyWYx. However, I dont see non-200 status codes in item_completed(). I think thats because as I mentioned in the question, a FileException is raised when non-200 status codes occur. – hAcKnRoCk Jan 20 '17 at 20:14
  • Actually `media_downloaded` receives any response, not only 200. What we do in the code above is overwrite the default `media_downloaded`, check if response is non-200, if so, return dict with response status, otherwise call parent method of ImagesPipeline - thus the code above runs for every response **before** raising the exception. – mizhgun Jan 20 '17 at 20:38
  • thanks for the guidance. I figured out the best way was to catch the exception and handle it instead of calling super only for non-200 responses. I will post my approach as a separate answer although your guidance was essential to proceed and figure out the answer – hAcKnRoCk Jan 25 '17 at 16:59
0

The right way to capture non-200 response codes seems to be inheriting media_downloaded but to call the parent function and catch the exception. Here is the code that works:

    def media_downloaded(self, response, request, info):
    try:
        resultdict = super(MyPipeline, self).media_downloaded(response, request, info)
        resultdict['status'] = response.status
        logging.warning('No Exception : {}'.format(response.status))
        return resultdict
    except FileException as exc:
        logging.warning('Caused Exception : {} {}'.format(response.status, str(exc)))
        return {'url': request.url, 'status': response.status}

The response code can be handled inside the item_completed()

def item_completed(self, results, item, info):
    image_paths = []
    for download_status, x in results:
        if x.get('status', True):
            item['result_download_status'] = x['status'] # contains non-200 response code
            if x['status'] == 200:
                image_paths.append(x['path'])
                item['images'] = image_paths  # update item image path
hAcKnRoCk
  • 1,118
  • 3
  • 16
  • 30