IMPORTANT NOTE: all the answers available at the moment on stackoverflow are for previous versions of Scrapy and don't work with the latest version of scrapy 1.4
Totally new to scrapy and python, I am trying to scrape some pages and download the images. The images are being downloaded but they still have the original SHA-1 name as filenames. I cannot figure out how to rename the files, they actually all have the SHA-1 filenames.
Tryed to rename them as "test", and I do have "test" appearing in the outputs when I run scrapy crawl rambopics
, along with the url's data. But the files dont get renamed in the destination folder. Here is a sample of the output:
> 2017-06-11 00:27:06 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.theurl.com/> {'image_urls':
> ['https://www.theurl.com/-a4Bj-ENjHOY/VyE1mGuJyUI/EAAAAAAAHMk/mw1_H-mEAc0QQEwp9UkTipxNCVR-xdbcgCLcB/s1600/Image%2B%25286%2525.jpg'],
> 'image_name': ['test'], 'title': ['test'], 'filename': ['test'],
> 'images': [{'url':
> 'https://www.theurl.com/-a4Bj-ENjHOY/VyE1mGuJyUI/EAAAAAAAHMk/mw1_H-mEAc0QQEwp9UkTipxNCVR-xdbcgCLcB/s1600/Image%2B%25286%2525.jpg',
> 'path': 'full/fcbec9bf940b48c248213abe5cd2fa1c690cb879.jpg',
> 'checksum': '7be30d939a7250cc318e6ef18a6b0981'}]}
So far I have tried many different solutions all posted on stackoverflow, there is just no clear answer to that question for the latest version of scrapy in 2017, it looks like the propositions are probably almost all outdated. I am using Scrapy 1.4 with python 3.6.
scrapy.cfg
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html
[settings]
default = rambopics.settings
[deploy]
#url = http://localhost:6800/
project = rambopics
items.py import scrapy
class RambopicsItem(scrapy.Item):
# defining items:
image_urls = scrapy.Field()
images = scrapy.Field()
image_name = scrapy.Field()
title = scrapy.Field()
#pass -- dont realy understand what pass is for
settings.py
BOT_NAME = 'rambopics'
SPIDER_MODULES = ['rambopics.spiders']
NEWSPIDER_MODULE = 'rambopics.spiders'
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = "W:/scraped/"
pipelines.py
import scrapy
from scrapy.pipelines.images import ImagesPipeline
class RambopicsPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
img_url = item['img_url']
meta = {
'filename': item['title'],
'title': item['image_name']
}
yield Request(url=img_url, meta=meta)
(the spider) rambopics.py
from rambopics.items import RambopicsItem
from scrapy.selector import Selector
import scrapy
class RambopicsSpider(scrapy.Spider):
name = 'rambopics'
allowed_domains = ['theurl.com']
start_urls = ['http://www.theurl.com/']
def parse(self, response):
for sel in response.xpath('/html'):
#img_name = sel.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
img_name = 'test'
#img_title = sel.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
img_title = 'test'
for elem in response.xpath("//div[contains(@class, 'entry-content')]"):
img_url = elem.xpath("a/@href").extract_first()
yield {
'image_urls': [img_url],
'image_name': [img_name],
'title': [img_title],
'filename': [img_name]
}
Note, I don't know what the correct meta name to use is for the final downloaded file name (I'm not sure if it's filename, image_name, or title).