Splash browser does not send anything to through the http proxy. The pages are fetched even when the proxy is not running.
I am using scrapy with splash in python 3 to fetch pages after authentication for a an Angular.js website. The script is able to fetch pages, authenticate, and fetch pages after authentication. However, it does not use the proxy setup at localhost:8090 and wireshark confirms that traffic coming from port 8050 goes to some port in the 50k range.
The setup is - splash running locally on a docker image (latest) on port 8050 - python 3 running locally on a mac - Zap proxy running locally on a mac at port 8090 - Web page accessed through VPN
I have tried to specify the proxy host:port through the server using Chrome with a LUA script. Page is fetched without the proxy.
I have tried to specify the proxy in the python script with both Lua and with the api (args={'proxy':'host:port'} and the page is fetched without using the proxy.
I have tried using the proxy-host file and I get status 502.
- Proxy set through Lua on Chrome (no error, not proxied):
function main(splash, args)
splash:on_request(function(request)
request:set_proxy{
host = "127.0.0.1",
port = 8090,
username = "",
password = "",
type = "HTTP"
}
end
)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
req = SplashRequest("http://mysite/home", self.log_in,
endpoint='execute', args={'lua_source': script})
- Proxy set through api (status 502):
req = SplashRequest("http://mysite/home",
self.log_in, args={'proxy': 'http://127.0.0.1:8090'})
- Proxy set through Lua in Python (no error, not proxied):
def start_requests(self):
script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
splash:on_request(function(request)
request:set_proxy{
host = "127.0.0.1",
port = 8090,
username = "",
password = "",
type = "HTTP"
}
end
)
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
"""
req = SplashRequest("http://mysite/home", self.log_in,
endpoint='execute', args={'lua_source': script})
# req.meta['proxy'] = 'http://127.0.0.1:8090'
yield req
- Proxy set through proxy file in docker image (status 502): proxy file:
[proxy]
; required
host=127.0.0.1
port=8090
Shell command:
docker run -it -p 8050:8050 -v ~/Documents/proxy-profile:/etc/splash/proxy-profiles scrapinghub/splash --proxy-profiles-path=/etc/splash/proxy-profiles
All of the above should display the page in zap proxy at port 8090.
Some of the above seem to set the proxy, but the proxy can't reach localhost:8090 (status 502). Some don't work at all (no error, not proxied). I think this may be related to fact that a docker image is being used.
I am not looking to use Selenium because that is what this replacing.