0

I am trying to crawl a website which is only accessible via a proxy. I have created a project called scrapy_crawler using Scrapy and the structure is as follows:

project structure

I have read that I need to enable the HttpProxyMiddleware in settings.py.

DOWNLOADER_MIDDLEWARES = {   
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100
}

I am a bit lost after that. I think I need to include the proxy in the request but I am not sure where to do that. I have tried the following in the middlewares.py file.

 def process_start_requests(self, start_requests, spider):
    # Called with the start requests of the spider, and works
    # similarly to the process_spider_output() method, except
    # that it doesn’t have a response associated.

    # Must return only requests (not items).
    for r in start_requests:
        r.meta['proxy'] = 'http://username:password@myproxy:port'
        yield r

And here is the digtionary.py file for reference.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class ImdbCrawler(CrawlSpider):
  name = 'digtionary'
  allowed_domains = ['www.mywebsite.com']
  start_urls = ['https://mywebsite.com/digital/pages/start.aspx#']
  rules = (Rule(LinkExtractor()),)

Any kind of help will be much appreciated. Thanks in advance.

NShini
  • 35
  • 5
  • Does this answer your question? [Scrapy and proxies](https://stackoverflow.com/questions/4710483/scrapy-and-proxies) – gangabass Aug 27 '21 at 14:58
  • Hello, yes I have tried this but I forgot to mention that I need to parse a username and password too. Is there any way of doing that? Thanks – NShini Aug 27 '21 at 18:50
  • https://www.zyte.com/blog/scrapy-proxy/ if you need proxy authorization. – gangabass Aug 28 '21 at 01:31

0 Answers0