0

EDIT 2 - Because my folders got mixed up with names I chose, I accidentally posted the wrong code. Please see below for accurate code of each file for the correct folder containing all my files for this.

Settings

 # -*- coding: utf-8 -*-

# Scrapy settings for pics project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'pics'

SPIDER_MODULES = ['pics.spiders']
NEWSPIDER_MODULE = 'pics.spiders'

IMAGES_STORE = 'W:/scrapy/scraped/'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'pics (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'pics.middlewares.MyCustomSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'pics.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'pics.pipelines.ImagesPipeline': 300,
}



# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Pipeline.py

   import scrapy

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request
from PIL import Image


class PicsPipeline(object):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(url=img_url, meta=meta)


    def process_item(self, item, spider):
        return item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.item import Item

class PicsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #  pass

    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_name = scrapy.Field()

blogspot.py

    import scrapy

from scrapy.selector import Selector, HtmlXPathSelector
from pics.items import PicsItem
from PIL import Image


class BlogspotSpider(scrapy.Spider):
    name = "blogspot"
    allowed_domains = ['blogspot.fr']
    start_urls = ["http://10rambo.blogspot.fr/"]

    def parse(self, response):

        LOG_FILE = "spider.log"

        for sel in response.xpath('/html'):
            item = PicsItem()

        for elem in response.xpath("//div[contains(@class, 'hentry')]"):
                item['image_name'] = elem.xpath("//h3[contains(@class, 'entry-title')]/a/text()").extract()
                url = elem.xpath("//div[contains(@class, 'entry-content')]/a/@href").extract_first()
                item['image_urls'] = url

                yield scrapy.Request(response.urljoin(url), callback=self.parse, dont_filter=True)

This is what the log says at the current moment :

2017-06-15 21:20:00 [scrapy] INFO: Scrapy 1.2.1 started (bot: pics)
2017-06-15 21:20:00 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pics.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['pics.spiders'], 'LOG_FILE': 'log.log', 'BOT_NAME': 'pics'}
2017-06-15 21:20:00 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-06-15 21:20:00 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-15 21:20:00 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-15 21:20:00 [scrapy] INFO: Enabled item pipelines:
['pics.pipelines.ImagesPipeline']
2017-06-15 21:20:00 [scrapy] INFO: Spider opened
2017-06-15 21:20:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-15 21:20:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-15 21:20:00 [scrapy] DEBUG: Crawled (200) <GET http://10rambo.blogspot.fr/robots.txt> (referer: None)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET http://10rambo.blogspot.fr/> (referer: None)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (404) <GET https://2.bp.blogspot.com/robots.txt> (referer: None)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:01 [scrapy] DEBUG: Crawled (200) <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://12manrambotapes.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] ERROR: Spider error processing <GET https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg> (referer: http://10rambo.blogspot.fr/)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "W:\scrapy\pics\pics\spiders\blogspot.py", line 17, in parse
    for sel in response.xpath('/html'):
AttributeError: 'Response' object has no attribute 'xpath'
2017-06-15 21:20:02 [scrapy] INFO: Closing spider (finished)
2017-06-15 21:20:02 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3247,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 2120514,
 'downloader/response_count': 10,
 'downloader/response_status_count/200': 9,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 6, 15, 19, 20, 2, 80000),
 'log_count/DEBUG': 11,
 'log_count/ERROR': 7,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 10,
 'scheduler/dequeued': 8,
 'scheduler/dequeued/memory': 8,
 'scheduler/enqueued': 8,
 'scheduler/enqueued/memory': 8,
 'spider_exceptions/AttributeError': 7,
 'start_time': datetime.datetime(2017, 6, 15, 19, 20, 0, 349000)}
2017-06-15 21:20:02 [scrapy] INFO: Spider closed (finished)
mlclm
  • 725
  • 6
  • 16
  • 38
  • you'll have to share your entire code and the entire error log, to check in which line that error is – eLRuLL Jun 14 '17 at 19:02
  • @eLRuLL ok I have edited my post – mlclm Jun 14 '17 at 19:53
  • This is not the same error you shared before – eLRuLL Jun 14 '17 at 20:02
  • @eLRuLL Thanks for your reply. I made a mistake and posted wrong code by accident. I have cleaned my files and edited my post above with correct code. The error you can see outputed at the moment is :`'AttributeError: 'Response' object has no attribute 'xpath'` – mlclm Jun 15 '17 at 19:28
  • I update my post and added the content of the items.py file . Would you have any info to give me to help me out ..? – mlclm Jun 16 '17 at 04:42
  • very strange for sure. I don't know if it could be related but your pipeline can't `yield` Requests, so maybe remove the pipeline from settings and try again. besides that maybe your scrapy installation is incorrect? try reinstalling it or creating a separate virtualenvironment. Also make sure you have the module `parsel` installed and working. – eLRuLL Jun 16 '17 at 13:57

2 Answers2

0

This part of the log states what's wrong:

File "W:\path\to\mypics\spiders\blogspot.py", line 29, in parse
  yield Request(response.urljoin(url), callback=self.parse)
NameError: global name 'Request' is not defined

Seems like you forgot to specify the scrapy module for the Request class. It doesn't correspond to the code you posted, though. You have no blogspot.py but spider.py, and inside the parse method, you correctly specify the scrapy module for the Request class. On the other hand, you are missing it in the Pipeline.py in method get_media_requests:

yield Request(url=img_url, meta=meta)
Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39
  • Thanks for your reply. I accidentely posted the wrong bits of code at first. I have edited my post with the correct code for my project. The error I'm facing is "AttributeError: 'Response' object has no attribute 'xpath'. I googled it but I cannot find any real hints to resolve this issue. – mlclm Jun 15 '17 at 19:27
  • I updated my post and added the content of the items.py file which was missing. – mlclm Jun 16 '17 at 04:44
0

Spider error processing https://2.bp.blogspot.com/-NTt9PYw8Ohw/V2mHJ-pakyI/AAAAAAAAHPw/o6I__73FpLoN2N_nTGnxCQqC4PwsLRrZQCLcB/s1600/Image%2B%252822%2529.jpg>

It's not html - it's jpg file. It has not xpath.

Verz1Lka
  • 406
  • 4
  • 15
  • I believe that xpath can target a jpg file url ( https://stackoverflow.com/questions/1179641/xpath-to-parse-src-from-img-tag ) – mlclm Jun 16 '17 at 11:41
  • you can extract link to image from html response. But if you have not html response (JPG) it has not this method – Verz1Lka Jun 16 '17 at 11:57