0

IMPORTANT NOTE: all the answers available at the moment on stackoverflow are for previous versions of Scrapy and don't work with the latest version of scrapy 1.4

Totally new to scrapy and python, I am trying to scrape some pages and download the images. The images are being downloaded but they still have the original SHA-1 name as filenames. I cannot figure out how to rename the files, they actually all have the SHA-1 filenames.

Tryed to rename them as "test", and I do have "test" appearing in the outputs when I run scrapy crawl rambopics , along with the url's data. But the files dont get renamed in the destination folder. Here is a sample of the output:

> 2017-06-11 00:27:06 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.theurl.com/> {'image_urls':
> ['https://www.theurl.com/-a4Bj-ENjHOY/VyE1mGuJyUI/EAAAAAAAHMk/mw1_H-mEAc0QQEwp9UkTipxNCVR-xdbcgCLcB/s1600/Image%2B%25286%2525.jpg'],
> 'image_name': ['test'], 'title': ['test'], 'filename': ['test'],
> 'images': [{'url':
> 'https://www.theurl.com/-a4Bj-ENjHOY/VyE1mGuJyUI/EAAAAAAAHMk/mw1_H-mEAc0QQEwp9UkTipxNCVR-xdbcgCLcB/s1600/Image%2B%25286%2525.jpg',
> 'path': 'full/fcbec9bf940b48c248213abe5cd2fa1c690cb879.jpg',
> 'checksum': '7be30d939a7250cc318e6ef18a6b0981'}]}

So far I have tried many different solutions all posted on stackoverflow, there is just no clear answer to that question for the latest version of scrapy in 2017, it looks like the propositions are probably almost all outdated. I am using Scrapy 1.4 with python 3.6.

scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = rambopics.settings

[deploy]
#url = http://localhost:6800/
project = rambopics

items.py import scrapy

class RambopicsItem(scrapy.Item):
    # defining items:
     image_urls = scrapy.Field()
     images = scrapy.Field()
     image_name = scrapy.Field()
     title = scrapy.Field()
    #pass -- dont realy understand what pass is for

settings.py

BOT_NAME = 'rambopics'

SPIDER_MODULES = ['rambopics.spiders']
NEWSPIDER_MODULE = 'rambopics.spiders'


ROBOTSTXT_OBEY = True


ITEM_PIPELINES = {
   'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = "W:/scraped/"

pipelines.py

import scrapy
from scrapy.pipelines.images import ImagesPipeline

class RambopicsPipeline(ImagesPipeline):


    def get_media_requests(self, item, info):

        img_url = item['img_url']
        meta = {
                  'filename': item['title'],
                   'title': item['image_name']
                }

        yield Request(url=img_url, meta=meta)

(the spider) rambopics.py

from rambopics.items import RambopicsItem
from scrapy.selector import Selector
import scrapy


class RambopicsSpider(scrapy.Spider):
    name = 'rambopics'
    allowed_domains = ['theurl.com']
    start_urls = ['http://www.theurl.com/']

    def parse(self, response):

        for sel in response.xpath('/html'):
            #img_name = sel.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
            img_name = 'test'
            #img_title = sel.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
            img_title = 'test' 

        for elem in response.xpath("//div[contains(@class, 'entry-content')]"):
            img_url = elem.xpath("a/@href").extract_first()


            yield {
                   'image_urls': [img_url],
                   'image_name': [img_name],
                   'title': [img_title],
                   'filename': [img_name]
               }

Note, I don't know what the correct meta name to use is for the final downloaded file name (I'm not sure if it's filename, image_name, or title).

mlclm
  • 725
  • 6
  • 16
  • 38
  • this question is already answered on the site here: https://stackoverflow.com/a/30002870/1675954 and here: https://stackoverflow.com/a/6196180/1675954 They are both fairly comprehensive answers. Check that your base settings are configured properly. They need to be set before crawling. See https://doc.scrapy.org/en/latest/topics/api.html#scrapy.settings.BaseSettings.set – Rachel Gallen Jun 11 '17 at 08:27
  • Possible duplicate of [Renaming downloaded images in Scrapy 0.24 with content from an item field while avoiding filename conflicts?](https://stackoverflow.com/questions/29946989/renaming-downloaded-images-in-scrapy-0-24-with-content-from-an-item-field-while) – Rachel Gallen Jun 11 '17 at 08:29
  • Please explain in more detail what is not working with other solutions. Are there specific errors you get? – OneCricketeer Jun 11 '17 at 12:48

1 Answers1

1

Use file_path method to change image names as follows:

class SaveImagesPipeline(FilesPipeline):
    def get_media_requests(self, item, info):
        i = 1
        for image_url in item['image_urls']:
            filename = '{}_{}.jpg'.format(item['name_image'], i)
            yield scrapy.Request(image_url, meta={'filename': filename})
            i += 1
    return

    def file_path(self, request, response=None, info=None):
        return request.meta['filename']
Verz1Lka
  • 406
  • 4
  • 15
  • file_path is a method from Images Pipeline that we can override? I can't see that on official docs. – notGeek Nov 04 '18 at 13:16