I'm trying to use scrapy over Tor. I've been trying to get my head around how to write a DownloadHandler for scrapy that uses socksipy connections.
Scrapy's HTTP11DownloadHandler is here: https://github.com/scrapy/scrapy/blob/master/scrapy/core/downloader/handlers/http11.py
Here is an example for creating a custom download handler: https://github.com/scrapinghub/scrapyjs/blob/master/scrapyjs/dhandler.py
Here's the code for creating a SocksiPyConnection class: http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/
class SocksiPyConnection(httplib.HTTPConnection):
def __init__(self, proxytype, proxyaddr, proxyport = None, rdns = True, username = None, password = None, *args, **kwargs):
self.proxyargs = (proxytype, proxyaddr, proxyport, rdns, username, password)
httplib.HTTPConnection.__init__(self, *args, **kwargs)
def connect(self):
self.sock = socks.socksocket()
self.sock.setproxy(*self.proxyargs)
if isinstance(self.timeout, float):
self.sock.settimeout(self.timeout)
self.sock.connect((self.host, self.port))
With the complexity of twisted reactors in the scrapy code, I can't figure out how plug socksipy into it. Any thoughts?
Please do not answer with privoxy-like alternatives or post answers saying "scrapy doesn't work with socks proxies" - I know that, which is why I'm trying to write a custom Downloader that makes requests using socksipy.