1

Without overwriting the file_path function, the spider download all the images with the default 'request URL hash' filenames. However when I try to overwrite the function it just doesn't work. There is nothing in the default output attribute, images.

I have tried both relative and absolute paths for the IMAGES_STORE variable in settings.py as well as the file_path function to no avail. Even when I overwrite the file_path function with the exact same default file_path function, the images do not download.

Any help would be much appreciated!

settings.py

BOT_NAME = 'HomeApp2'

SPIDER_MODULES = ['HomeApp2.spiders']
NEWSPIDER_MODULE = 'HomeApp2.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'

# ScrapySplash settings
SPLASH_URL = 'http://192.168.99.100:8050'
DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        }
SPIDER_MIDDLEWARES = {
        'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        }
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'HomeApp2.pipelines.DuplicatesPipeline': 250,
    'HomeApp2.pipelines.ProcessImagesPipeline': 251,
    'HomeApp2.pipelines.HomeApp2Pipeline': 300,
}

IMAGES_STORE = 'files'

pipelines.py

import json
import scrapy
from scrapy.exceptions import DropItem  
from scrapy.pipelines.images import ImagesPipeline

class DuplicatesPipeline(object):  
    def __init__(self): 
        self.sku_seen = set() 

    def process_item(self, item, spider): 
        if item['sku'] in self.sku_seen: 
            raise DropItem("Repeated item found: %s" % item) 
        else: 
            self.sku_seen.add(item['sku']) 
            return item

class ProcessImagesPipeline(ImagesPipeline):

    '''
    def file_path(self, request):
        print('!!!!!!!!!!!!!!!!!!!!!!!!!')
        sku = request.meta['sku']
        num = request.meta['num']
        return '%s/%s.jpg' % (sku, num)
    '''

    def get_media_requests(self, item, info):
        print('- - - - - - - - - - - - - - - - - -')
        sku = item['sku']
        for num, image_url in item['image_urls'].items():
            yield scrapy.Request(url=image_url, meta = {'sku': sku,
                                                        'num': num})

class HomeApp2Pipeline(object):
    def __init__(self):
        self.file = open('items.jl', 'w')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'
        self.file.write(line)
        return item

AppScrape2.py

import scrapy
from scrapy_splash import SplashRequest
from HomeApp2.items import HomeAppItem

class AppScrape2Spider(scrapy.Spider):
    name = 'AppScrape2'

    def start_requests(self):
        yield SplashRequest(
            url = 'https://www.appliancesonline.com.au/product/samsung-sr400lstc-400l-top-mount-fridge?sli_sku_jump=1',
            callback = self.parse,
        )

    def parse(self, response):

        item = HomeAppItem()

        product = response.css('aol-breadcrumbs li:nth-last-of-type(1) .breadcrumb-link ::text').extract_first().rsplit(' ', 1)
        if product is None:
            return {}
        item['sku'] = product[-1]
        item['image_urls'] = {}

        root_url = 'https://www.appliancesonline.com.au'
        product_picture_count = 0
        for pic in response.css('aol-product-media-gallery-main-image-portal img.image'):
            product_picture_count = product_picture_count + 1
            item['image_urls']['p'+str(product_picture_count)] = (
            root_url + pic.css('::attr(src)').extract_first())

        feature_count = 0
        for feat in response.css('aol-product-features .feature'):
            feature_count = feature_count + 1
            item['image_urls']['f'+str(feature_count)] = (
            root_url + feat.css('.feature-image ::attr(src)').extract_first())

        yield item

items.py

import scrapy

class HomeAppItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    sku = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()

    pass
Isaac Ng
  • 76
  • 9
  • Maybe it should be inherited from class `FilesPipeline` - I have old example code which changes filenames and it uses `FilesPipeline` - [Scrapy template code](https://github.com/furas/python-examples/blob/master/scrapy/__template__/python%20-%20scrapy.py) BTW: this code doesn't need project - it has all elements in one file and runs as standalone script. – furas Dec 07 '19 at 04:18
  • BTW in comments in my code I found that it has to send images as `{'image_urls': [url]}` so maybe problem is that you put urls in `['image_urls']['product']`. Or maybe something is changed in Scrapy for years. – furas Dec 07 '19 at 04:26
  • BTW: in code you could use `print()` or `logging` to see if it is executed. – furas Dec 07 '19 at 04:34
  • Did you check if `file_path` is ever called? What happens if it raises an exception? – Gallaecio Dec 10 '19 at 07:53
  • @furas I tried inheriting from FilesPipeline without success, same problem occurs. As for the nested dicts, I removed it and tried the code again, still doesn't work when I override the file_path method. – Isaac Ng Jan 10 '20 at 12:14
  • @Gallaecio I tried logging after overriding the file_path method. It seems it wasn't called at all. – Isaac Ng Jan 10 '20 at 12:15
  • If it is not called at all, you may be overriding what used to call it from the parent class, or you may be simply not causing the pipeline to be used at all, or you might not have properly enabled the pipeline. In any case, you may want to try to reproduce the issue with minimal code, and update your question accordingly, making it simpler, hence easier to answer. – Gallaecio Jan 10 '20 at 12:44
  • @Gallaecio Hopefully the code is easier to debug now. The problem still persist with the edited code. The pipeline is definitely used as the get_media_requests method is called. – Isaac Ng Jan 11 '20 at 10:44
  • @Gallaecio Thanks! Seems like it was an incorrect override of the method. – Isaac Ng Jan 11 '20 at 11:14

1 Answers1

1

After much trial and error, I found the solution. It was simply adding the rest of the parameters to the file_path method.

Changing

def file_path(self, request):

to

def file_path(self, request, response=None, info=None):

It seems that the my original code overrode the method incorrectly causing calls to the method to fail.

Isaac Ng
  • 76
  • 9
  • 1
    There was a bug in the documentation. [It should be fixed in Scrapy 2.0](https://github.com/scrapy/scrapy/pull/4290). – Gallaecio Feb 19 '20 at 10:32