7

I'm starting with scrapy, and I have first real problem. It's downloading pictures. So this is my spider.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from example.items import ProductItem
from scrapy.utils.response import get_base_url

import re

class ProductSpider(CrawlSpider):
    name = "product"
    allowed_domains = ["domain.com"]
    start_urls = [
            "http://www.domain.com/category/supplies/accessories.do"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        sites = hxs.select('//td[@class="thumbtext"]')
        number = 0
        for site in sites:
            item = ProductItem()
            xpath = '//div[@class="thumb"]/img/@src'
            item['image_urls'] = site.select(xpath).extract()[number]
            item['image_urls'] = 'http://www.domain.com' + item['image_urls']
            items.append(item)
            number = number + 1
        return items

When I quote ITEM_PIPELINES and IMAGES_STORE in settings.py this way I get the proper URL for picture I want to download (copy pasted it into browser for check).

But when I unquote those i get following error:

raise ValueError('Missing scheme in request url: %s' % self._url')
exceptions.ValueError: Missing scheme in request url:h

and I can't download my pictures.

I've searched for the whole day and didn't find anything helpful.

iblazevic
  • 2,713
  • 2
  • 23
  • 38
  • do you have a pipeline to process the urls? did you register your pipeline in settings.py? http://doc.scrapy.org/en/latest/topics/images.html, is great reference. do you have the proper permissions to write to the IMAGE_STORE path? – dm03514 Jan 08 '12 at 01:12
  • yes I did everything as it is said, actually I used that reference but still...no – iblazevic Jan 08 '12 at 19:28

2 Answers2

12

I think the image URL you scraped is relative. To construct the absolute URL use urlparse.urljoin:

def parse(self, response):
    ...
    image_relative_url = hxs.select("...").extract()[0]
    import urlparse
    image_absolute_url = urlparse.urljoin(response.url, image_relative_url.strip())
    item['image_urls'] = [image_absolute_url]
    ...

Haven't used ITEM_PIPELINES, but the docs say:

In a Spider, you scrape an item and put the URLs of its images into a image_urls field.

So, item['image_urls'] should be a list of image URLs. But your code has:

item['image_urls'] = 'http://www.domain.com' + item['image_urls']

So, i guess it iterates your single URL char by char - using each as URL.

warvariuc
  • 57,116
  • 41
  • 173
  • 227
  • This didn't help. As I said I already have absolute path my way, I tested url I get and it was indeed the url of an image. I tried this and result is same like before, I get good url but when I turn on ITEM_PIPELINES and IMAGES_STORE i get the same error as before – iblazevic Jan 08 '12 at 19:03
  • but this way of getting absolute url is definitely better, so thanks for that – iblazevic Jan 08 '12 at 19:04
  • @iblazevic, see my update. And don't forget to upvote/accept answers – warvariuc Jan 08 '12 at 19:55
  • edit file in the scrapy sources: `scrapy/scrapy/contrib/pipeline/images.py`, method `ImagesPipeline.get_media_requests`. Put there `print item.get('image_urls', [])` – warvariuc Jan 09 '12 at 11:24
  • you can even add a print right before where the exception is raised, to see _what_ url is invalid. > it's getting so frustrating < this is how you get experience :) – warvariuc Jan 09 '12 at 11:26
  • I don't think `u` is the problem. But why Error processing `{'image_urls': u'http ://www.reallygoodstuff.com/images/set_a/en_us/local/products/thumb/159361.jpg'}` and not `Error processing u'http ://www.reallygoodstuff.com/images/set_a/en_us/local/products/thumb/159361.jpg`? Isn't the item['image_urls] in your case a list of dicts? I think you still must check if item['iamge_urls'] is a list of urls. – warvariuc Jan 09 '12 at 14:14
  • Upvoted, since this answer deserves some regonition since the question asker basically used your answer to solve their issue – Stephen Jan 10 '12 at 08:49
7

I think that you may need to provide your image url in a list to the Item:

item['image_urls'] = [ 'http://www.domain.com' + item['image_urls'] ]
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
ddn
  • 1,360
  • 11
  • 8