I have a working spider scraping image URLs and placing them in image_urls field of a scrapy.Item. I have a custom pipeline that inherits from ImagesPipeline. When a specific URL returns a non-200 http response code (like say a 401 error). For instance, in the log files, I find
WARNING:scrapy.pipelines.files:File (code: 404): Error downloading file from <GET http://a.espncdn.com/combiner/i%3Fimg%3D/i/headshots/tennis/players/full/425.png> referred in <None>
WARNING:scrapy.pipelines.files:File (code: 307): Error downloading file from <GET http://www.fansshare.com/photos/rogerfederer/federer-roger-federer-406468306.jpg> referred in <None>
However, I am unable to capture the error codes 404, 307 etc in my custom image pipeline in the item_completed()
function:
def item_completed(self, results, item, info):
image_paths = []
for download_status, x in results:
if download_status:
image_paths.append(x['path'])
item['images'] = image_paths # update item image path
item['result_download_status'] = 1
else:
item['result_download_status'] = 0
#x.printDetailedTraceback()
logging.info(repr(x)) # x is a twisted failure object
return item
Digging through the scrapy source code, inside the media_downloaded()
function in files.py, I found that for non-200 response codes, a warning is logged (which explains the above WARNING lines) and then a FileException
is raised.
if response.status != 200:
logger.warning(
'File (code: %(status)s): Error downloading file from '
'%(request)s referred in <%(referer)s>',
{'status': response.status,
'request': request, 'referer': referer},
extra={'spider': info.spider}
)
raise FileException('download-error')
How do I also access this response code so I can handle it in my pipeline in item_completed() function?