I am trying to crawl a website which is only accessible via a proxy. I have created a project called scrapy_crawler using Scrapy and the structure is as follows:
I have read that I need to enable the HttpProxyMiddleware in settings.py.
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 100
}
I am a bit lost after that. I think I need to include the proxy in the request but I am not sure where to do that. I have tried the following in the middlewares.py file.
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
r.meta['proxy'] = 'http://username:password@myproxy:port'
yield r
And here is the digtionary.py file for reference.
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ImdbCrawler(CrawlSpider):
name = 'digtionary'
allowed_domains = ['www.mywebsite.com']
start_urls = ['https://mywebsite.com/digital/pages/start.aspx#']
rules = (Rule(LinkExtractor()),)
Any kind of help will be much appreciated. Thanks in advance.