Proxy servers with Scrapy-Splash

Question

I am trying to get proxy servers to work on my local splash instance. I have read several documents, but have not found any workable examples. It was brought to my attention that this https://github.com/scrapy-plugins/scrapy-splash/issues/107 was the cause. I am no longer getting that traceback, but still can't use Splash with proxies. New error message below. Thanks in advance if anybody can help me solve this. None of my requests are even making it through to Splash.

  def parse_json(self, response):
    json_data = response.body
    load = json.loads(json_data.decode('utf-8'))
    dump = json.dumps(load,sort_keys=True,indent=2)
    LUA_SOURCE = """
    function main(splash)
        local host = "proxy.crawlera.com"
        local port = 8010
        local user = "APIKEY"
        local password = ""
        local session_header = "X-Crawlera-Session"
        local session_id = "create"

        splash:on_request(function (request)
            request:set_header("X-Crawlera-UA", "desktop")
            request:set_header(session_header, session_id)
            request:set_proxy{host, port, username=user, password=password}
        end)

        splash:on_response_headers(function (response)
            if response.headers[session_header] ~= nil then
                session_id = response.headers[session_header]
            end
        end)

        splash:go(splash.args.url)
        return splash:html()
    end
    """
    for link in load['d']['blogtopics']:
        link = link['Uri']
        yield SplashRequest(link, self.parse_blog, endpoint='execute',  args={'wait': 3, 'lua_source': LUA_SOURCE})


2017-03-29 09:26:37 [scrapy.core.engine] DEBUG: Crawled (503) <GET http://community.martindale.com/legal-blogs/Practice_Areas/b/corporate__securities_law/archive/2011/08/11/sec-adopts-new-rules-replacing-credit-ratings-as-a-criterion-for-the-use-of-short-form-shelf-registration.aspx via http://localhost:8050/execute> (referer: None)

It looks like this open issue: https://github.com/scrapy-plugins/scrapy-splash/issues/107 — paul trmbrth, Mar 29 '17 at 10:33
You're right it was just confirmed a bug on my support ticket. Hopefully it gets fixed soon. I don't want to abandon splash. — eusid, Mar 29 '17 at 10:36
@eusid I think crawlera requires more custom splash code - can you check the example here https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-splash ? — Konstantin Lopuhin, Mar 29 '17 at 14:19
I started out with that solution but narrowed it down to this for the sake of simplicity. Do you know what is required? Setting the headers? — eusid, Mar 29 '17 at 14:20
Using that exact code with same results. Is there anybody that has this working ? 503 error. I really want to use splash but considering abandoning to just use regular webkit at this point. There has to be someone that knows how to make this work. — eusid, Mar 29 '17 at 14:28

score 2 · Accepted Answer · answered Mar 30 '17 at 03:02

2

Problem appears to be caused by Crawlera middleware. There is no handling for SplashRequest. It tries to go through proxy to my local host.

answered Mar 30 '17 at 03:02

eusid

769
2
6
18

Also to add comment the proxy worked on the main request but was failing on many of the browsers sub requests since crawlera is request based each request is actually many requests. – eusid Jun 14 '19 at 06:20

Proxy servers with Scrapy-Splash

1 Answers1