0

I'm trying to set up image downloading from web pages by using Scrapy Framework and djano-item. I think I have done everything like in doc but after calling scrapy crawl I log looking like this:

Scrapy log

I can't find there any information on what went wrong but Images field Is empty and directory does not contain any images.

This is my model

class Event(models.Model):
    title = models.CharField(max_length=100, blank=False)
    description = models.TextField(blank=True, null=True)
    event_location = models.CharField(max_length=100, blank = True, null= True)
    image_urls = models.CharField(max_length = 200, blank = True, null = True)
    images = models.CharField(max_length=100, blank = True, null = True)
    url = models.URLField(max_length=200)

    def __unicode(self):
        return self.title

and this is how i go from spider to image pipeline

def parse_from_details_page(self, response):
    "Some code"
    item_event = item_loader.load_item()
    #this is to create image_urls list (there is only one image_url allways)
    item_event['image_urls'] = [item_event['image_urls'],]
    return item_event

and finally this is my settings.py for Scrapy project:

import sys
import os
import django

DJANGO_PROJECT_PATH = os.path.join(os.path.dirname((os.path.abspath(__file__))), 'MyScrapy')
#sys.path.insert(0, DJANGO_PROJECT_PATH)
#sys.path.append(DJANGO_PROJECT_PATH)
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "MyScrapy.settings")
#os.environ["DJANGO_SETTINGS_MODULE"] = "MyScrapy.settings"


django.setup()

BOT_NAME = 'EventScraper'

SPIDER_MODULES = ['EventScraper.spiders']
NEWSPIDER_MODULE = 'EventScraper.spiders'

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 100,
    'EventScraper.pipelines.EventscraperPipeline': 200,
}

#MEDIA STORAGE URL
IMAGES_STORE = os.path.join(DJANGO_PROJECT_PATH, "IMAGES")

#IMAGES (used to be sure that it takes good fields)
FILES_URLS_FIELD = 'image_urls'
FILES_RESULT_FIELD = 'images'

Thank you in advance for your help

EDIT:

I used custom image pipeline from doc looking like this,

class MyImagesPipeline(ImagesPipeline):

def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
        import ipdb; ipdb.set_trace()
        yield scrapy.Request(image_url)

def item_completed(self, results, item, info):
    import ipdb; ipdb.set_trace()
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
        raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

In get_media_requests it creates request to my Url but in item_completed in result param i get somethin like this : [(False, <twisted.python.failure.Failure scrapy.pipelines.files.FileException: >)] I still don't know how to fix it. Is it possible that the problem could be caused by a reference to the address with https ?

Max
  • 1
  • 3

1 Answers1

0

I faced the EXACT issue with scrapy. My Solution:

Added headers to the request you're yielding in the get_media_requests function. I added a user agent and a host along with some other headers. Here's my list of headers.

headers = {
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, sdch',
            'Accept-Language': 'en-GB,en-US;q=0.8,en;q=0.6',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Proxy-Connection': 'keep-alive',
            'Pragma': 'no-cache',
            'Cache-Control': 'no-cache',
            'Host': 'images.finishline.com',
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
        }

Open up the exact image url in your browser (the url with which you're downloading the image). Simply check your browser's network tab for the list of headers. Make sure your headers for that request I mentioned above are the same as those.

Hope it works.

Alexandru Marculescu
  • 5,569
  • 6
  • 34
  • 50