2

I can successfully access http pages through a proxy in Scrapy, but I cannot access https sites. I've researched the topic, but it's still unclear to me. Is it possible to access https pages through a proxy with Scrapy? Do I need to patch anything? Or add some custom code? If it can be confirmed this is a standard feature, I can follow up with more details. Hopefully this is an easy one.

Edited:

Here is what I added to the settings file:

DOWNLOADER_MIDDLEWARES = {'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110, 'test_website.middlewares.ProxyMiddleware': 100}
PROXIES = [{'ip_port': 'us-il.proxymesh.com:31280', 'user_pass': 'username:password'}]

Here is the code for my spider:

import scrapy

class TestSpider(scrapy.Spider):
    name = "test_spider"
    allowed_domains = "ipify.org"
     start_urls = ["https://api.ipify.org"]

    def parse(self, response):
        with open('test.html', 'wb') as f:
            f.write(response.body)

Here is the middlewares file:

import base64
import random
from settings import PROXIES

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)
        if proxy['user_pass'] is not None:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']
            encoded_user_pass = base64.encodestring(proxy['user_pass'])
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            
        else:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']

Here is the log file:

2015-08-12 20:15:50 [scrapy] INFO: Scrapy 1.0.3 started (bot: test_website)
2015-08-12 20:15:50 [scrapy] INFO: Optional features available: ssl, http11
2015-08-12 20:15:50 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test_website.spiders', 'SPIDER_MODULES': ['test_website.spiders'], 'LOG_STDOUT': True, 'LOG_FILE': 'log.txt', 'BOT_NAME': 'test_website'}
2015-08-12 20:15:51 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-08-12 20:15:53 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-12 20:15:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-12 20:15:53 [scrapy] INFO: Enabled item pipelines: 
2015-08-12 20:15:53 [scrapy] INFO: Spider opened
2015-08-12 20:15:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-08-12 20:15:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 1 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Retrying <GET https://api.ipify.org> (failed 2 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] DEBUG: Gave up retrying <GET https://api.ipify.org> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] ERROR: Error downloading <GET https://api.ipify.org>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2015-08-12 20:15:53 [scrapy] INFO: Closing spider (finished)
2015-08-12 20:15:53 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
 'downloader/request_bytes': 819,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 943000),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 8, 13, 2, 15, 53, 38000)}
2015-08-12 20:15:53 [scrapy] INFO: Spider closed (finished)

My spider works successfully if the 's' in 'https' is removed or I disable the proxy. I can access the https page through the proxy through my browser.

patrick.s
  • 43
  • 5

3 Answers3

1

I got this because of using base64.encodestring instead of base64.b64encode in the proxy middleware. Try changing it.

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
0

I think it's possible.

If you're setting the proxy through Request.meta it should just work. If you're setting the proxy with the http_proxy environment variable, you might also need to set https_proxy.

It might be the case, however, that your proxy does not support HTTPS.

It would be easier to help you if you posted the error you are getting.

Artur Gaspar
  • 4,407
  • 1
  • 26
  • 28
0

The Scrapy is automatically bypass the https ssl. like @Aminah Nuraini said, just using base64.encodestring instead of base64.b64encode in the proxy middleware

  1. add following code in middlewares.py

    import base64
    class ProxyMiddleware(object):
        # overwrite process request
        def process_request(self, request, spider):
            # Set the location of the proxy
            request.meta['proxy'] = "<PROXY_SERVER>:<PROXY_PROT>"
            # Use the following lines if your proxy requires authentication
            proxy_user_pass = "<PROXY_USERNAME>:<PROXY_PASSWORD>"
            # setup basic authentication for the proxy
            encoded_user_pass = base64.b64encode(proxy_user_pass)
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
    
  2. Add the proxy in settings.py

    DOWNLOADER_MIDDLEWARES = {
    
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    
        '<YOUR_PROJECT>.middlewares.ProxyMiddleware': 100,
    
    }
    
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
Ken
  • 325
  • 3
  • 10