0

I have built a scraper and would like to download some images using a proxy in scrapy. I don't know if it is really downloading through the proxy. Reponse Headers don't show the IP. Furthermore, if I change the IP to a random IP, it still downloads the Image. How can I ensure it is using a proxy to download the images? Thanks

Pipelines.py

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        meta = {'proxy': 'http://23.323.44.22:11111/'}
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url,meta=meta)

Settings.py

ITEM_PIPELINES = {'myproject.pipelines.MyImagesPipeline': 1}
zer02
  • 3,963
  • 4
  • 31
  • 66

1 Answers1

1

If the download works with a random IP, the proxy is not used.

The Scrapy Doc says: "You can also set the meta key proxy per-request, to a value like http://some_proxy_server:port. Maybe the '/' at the end of your proxy url confuses Scrapy?

To make sure that a proxy is used, I would use Wireshark and filters on the proxy IP. If you see traffic for it's IP, it is likely that it is used.

rfelten
  • 181
  • 6