Scrapy ImagePipeline ignore image on specific host

Question

I have an issue where my ImagePipeline is downloading some images, while completely ignoring others. I test this by hardcoding the image path by using loader.set_value().

Here are two examples of the same image, note that I only write 1 line at a time, not both at same time.

# Test A, Works fine. Scrapy DOES download.
loader.add_value('image_urls', ['http://hemmon.com/house.jpg'])

# Test B, Not working. Scrapy does NOT download.
loader.add_value('image_urls', ['https://media.fastighetsbyran.se/23566167.jpg?Bredd=300'])

Test A get downloaded successfully, Test B is completely ignored. No debug messags, no errors, nothing. I run exactly the same settings.py and no other changes. The image file is the same, I download it from the browser from the Test B path, and then upload it on my own website at Test A path. No changes to the file itself.

Note that I also tried other files on the same host. All of them are ignored.

Here's my settings.py:

BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
IMAGES_STORE = os.path.join(BASE_DIR, 'images')
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}

I found this post that seem to have similar issues and it was related to headers. That would explain why I can download the exact same image from one host but not from the other.

EDIT: I created a public repo that reproduces this issue.

Post a simple scraper code that can be used to test the issue — Tarun Lalwani, Sep 25 '17 at 14:34
@TarunLalwani I've created a public repo that demonstrate this here: https://github.com/marcuslind90/scrapy_error — Marcus Lind, Sep 25 '17 at 15:02
And if you remove the 's' from https, the same result ensues? — Uvar, Sep 25 '17 at 15:20

score 2 · Accepted Answer · answered Sep 25 '17 at 17:25

2

Your issue is actually printed to the logs

2017-09-25 22:53:17 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://media.fastighetsbyran.se/22943836.jpg>

So the fix is simple, set ROBOTSTXT_OBEY = False in your settings.py

answered Sep 25 '17 at 17:25

Tarun Lalwani

142,312
9
204
265

Oh, the file server had its own robots.txt file. Weird, thanks a lot. – Marcus Lind Sep 25 '17 at 23:12

Scrapy ImagePipeline ignore image on specific host

1 Answers1