0

I have an issue where my ImagePipeline is downloading some images, while completely ignoring others. I test this by hardcoding the image path by using loader.set_value().

Here are two examples of the same image, note that I only write 1 line at a time, not both at same time.

# Test A, Works fine. Scrapy DOES download.
loader.add_value('image_urls', ['http://hemmon.com/house.jpg'])

# Test B, Not working. Scrapy does NOT download.
loader.add_value('image_urls', ['https://media.fastighetsbyran.se/23566167.jpg?Bredd=300'])

Test A get downloaded successfully, Test B is completely ignored. No debug messags, no errors, nothing. I run exactly the same settings.py and no other changes. The image file is the same, I download it from the browser from the Test B path, and then upload it on my own website at Test A path. No changes to the file itself.

Note that I also tried other files on the same host. All of them are ignored.

Here's my settings.py:

BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
IMAGES_STORE = os.path.join(BASE_DIR, 'images')
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

I found this post that seem to have similar issues and it was related to headers. That would explain why I can download the exact same image from one host but not from the other.

EDIT: I created a public repo that reproduces this issue.

Marcus Lind
  • 10,374
  • 7
  • 58
  • 112

1 Answers1

2

Your issue is actually printed to the logs

2017-09-25 22:53:17 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://media.fastighetsbyran.se/22943836.jpg>

So the fix is simple, set ROBOTSTXT_OBEY = False in your settings.py

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265