2

I'm brand new to Python so I apologize if there's a dumb mistake here...I've been scouring the web for days, looking at similar issues and combing through Scrapy docs and nothing seems to really resolve this for me...

I have a Scrapy project which successfully scrapes the source website, returns the required items, and then uses an ImagePipeline to download (and then rename accordingly) the images from the returned image links... but only when I run from the terminal with "runspider".

Whenever I use "crawl" from the terminal or CrawlProcess to run the spider from within the script, it returns the items but does not download the images and, I assume, completely misses the ImagePipeline.

I read that I needed to import my settings when running this way in order to properly load the pipeline, which makes sense after looking into the differences between "crawl" and "runspider" but I still cannot get the pipeline working.

There are no error messages but I notice that it does return "[scrapy.middleware] INFO: Enabled item pipelines: []" ... Which I assumed was showing that it is still missing my pipeline?

Here's my spider.py:

import scrapy
from scrapy2.items import Scrapy2Item
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

class spider1(scrapy.Spider):
    name = "spider1"
    domain = "https://www.amazon.ca/s?k=821826022317"

    def start_requests(self):
        yield scrapy.Request(url=spider1.domain ,callback = self.parse)

    def parse(self, response):

        items = Scrapy2Item()

        titlevar = response.css('span.a-text-normal ::text').extract_first()
        imgvar = [response.css('img ::attr(src)').extract_first()]
        skuvar = response.xpath('//meta[@name="keywords"]/@content')[0].extract()

        items['title'] = titlevar
        items['image_urls'] = imgvar
        items['sku'] = skuvar

        yield items

process = CrawlerProcess(get_project_settings())
process.crawl(spider1)
process.start()

Here is my items.py:

import scrapy

class Scrapy2Item(scrapy.Item):
    title = scrapy.Field()
    image_urls = scrapy.Field()
    sku = scrapy.Field()

Here is my pipelines.py:

import scrapy
from scrapy.pipelines.images import ImagesPipeline

class Scrapy2Pipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        return [scrapy.Request(x, meta={'image_name': item['sku']})
                for x in item.get('image_urls', [])]

    def file_path(self, request, response=None, info=None):
        return '%s.jpg' % request.meta['image_name']

Here is my settings.py:

BOT_NAME = 'scrapy2'

SPIDER_MODULES = ['scrapy2.spiders']
NEWSPIDER_MODULE = 'scrapy2.spiders'

ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {
   'scrapy2.pipelines.Scrapy2Pipeline': 1,
}

IMAGES_STORE = 'images'

Thank you to anybody that looks at this or even attempts to help me out. It's greatly appreciated.

tycrone
  • 21
  • 2
  • Is there any way you could simplify your code to produce a shorter [MRE](https://stackoverflow.com/help/minimal-reproducible-example)? – Geza Kerecsenyi Aug 22 '19 at 21:48
  • 1
    Sure thing, @GezaKerecsenyi. I've simplified my spider.py code and removed everything that didn't seem relevant to reproduce this scenario. – tycrone Aug 23 '19 at 18:49
  • thanks, it looks a lot nicer! – Geza Kerecsenyi Aug 23 '19 at 18:54
  • Have you checked the return value of `get_project_settings()`? Maybe your settings are not found. If that is the case, you can either find out why or define your settings in the spider itself using the `custom_settings` spider attribute. – Gallaecio Aug 26 '19 at 08:31

1 Answers1

0

Since you are running your spider as a script, there is no scrapy project environment, get_project_settings won't work (aside from grabbing the default settings). The script must be self-contained, i.e. contain everything you need to run your spider (or import it from your python search path, like any regular old python code).

I've reformatted that code for you, so that it runs, when you execute it with the plain python interpreter: python3 script.py.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import scrapy
from scrapy.pipelines.images import ImagesPipeline

BOT_NAME = 'scrapy2'
ROBOTSTXT_OBEY = True
IMAGES_STORE = 'images'


class Scrapy2Item(scrapy.Item):
    title = scrapy.Field()
    image_urls = scrapy.Field()
    sku = scrapy.Field()

class Scrapy2Pipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        return [scrapy.Request(x, meta={'image_name': item['sku']})
                for x in item.get('image_urls', [])]

    def file_path(self, request, response=None, info=None):
        return '%s.jpg' % request.meta['image_name']

class spider1(scrapy.Spider):
    name = "spider1"
    domain = "https://www.amazon.ca/s?k=821826022317"

    def start_requests(self):
        yield scrapy.Request(url=spider1.domain ,callback = self.parse)

    def parse(self, response):

        items = Scrapy2Item()

        titlevar = response.css('span.a-text-normal ::text').extract_first()
        imgvar = [response.css('img ::attr(src)').extract_first()]
        skuvar = response.xpath('//meta[@name="keywords"]/@content')[0].extract()

        items['title'] = titlevar
        items['image_urls'] = imgvar
        items['sku'] = skuvar

        yield items

if __name__ == "__main__":
    from scrapy.crawler import CrawlerProcess
    from scrapy.settings import Settings

    settings = Settings(values={
        'BOT_NAME': BOT_NAME,
        'ROBOTSTXT_OBEY': ROBOTSTXT_OBEY,
        'ITEM_PIPELINES': {
            '__main__.Scrapy2Pipeline': 1,
        },
        'IMAGES_STORE': IMAGES_STORE,
        'TELNETCONSOLE_ENABLED': False,
    })

    process = CrawlerProcess(settings=settings)
    process.crawl(spider1)
    process.start()
nyov
  • 1,382
  • 7
  • 23