0

Splash browser does not send anything to through the http proxy. The pages are fetched even when the proxy is not running.

I am using scrapy with splash in python 3 to fetch pages after authentication for a an Angular.js website. The script is able to fetch pages, authenticate, and fetch pages after authentication. However, it does not use the proxy setup at localhost:8090 and wireshark confirms that traffic coming from port 8050 goes to some port in the 50k range.

The setup is - splash running locally on a docker image (latest) on port 8050 - python 3 running locally on a mac - Zap proxy running locally on a mac at port 8090 - Web page accessed through VPN

I have tried to specify the proxy host:port through the server using Chrome with a LUA script. Page is fetched without the proxy.

I have tried to specify the proxy in the python script with both Lua and with the api (args={'proxy':'host:port'} and the page is fetched without using the proxy.

I have tried using the proxy-host file and I get status 502.

  1. Proxy set through Lua on Chrome (no error, not proxied):
function main(splash, args)
  splash:on_request(function(request)
    request:set_proxy{
      host = "127.0.0.1",
      port = 8090,
      username = "",
      password = "",
      type = "HTTP"
    }
  end
  )
  assert(splash:go(args.url))
  assert(splash:wait(0.5))

  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

req = SplashRequest("http://mysite/home", self.log_in,
                     endpoint='execute', args={'lua_source': script})
  1. Proxy set through api (status 502):
req = SplashRequest("http://mysite/home",
                            self.log_in, args={'proxy': 'http://127.0.0.1:8090'})
  1. Proxy set through Lua in Python (no error, not proxied):
def start_requests(self):
        script = """
            function main(splash, args)

                assert(splash:go(args.url))
                assert(splash:wait(0.5))
                splash:on_request(function(request)
                    request:set_proxy{
                        host = "127.0.0.1",
                        port = 8090,
                        username = "",
                        password = "",
                        type = "HTTP"
                    }
                end
                )

                return {
                    html = splash:html(),
                    png = splash:png(),
                    har = splash:har(),
             }
            end
            """
        req = SplashRequest("http://mysite/home", self.log_in,
                            endpoint='execute', args={'lua_source': script})
        # req.meta['proxy'] = 'http://127.0.0.1:8090'
        yield req
  1. Proxy set through proxy file in docker image (status 502): proxy file:
[proxy]

; required
host=127.0.0.1
port=8090

Shell command:

docker run -it -p 8050:8050 -v ~/Documents/proxy-profile:/etc/splash/proxy-profiles scrapinghub/splash --proxy-profiles-path=/etc/splash/proxy-profiles

All of the above should display the page in zap proxy at port 8090.

Some of the above seem to set the proxy, but the proxy can't reach localhost:8090 (status 502). Some don't work at all (no error, not proxied). I think this may be related to fact that a docker image is being used.

I am not looking to use Selenium because that is what this replacing.

neoinageo
  • 31
  • 5

2 Answers2

2

All methods returning status 502 are working correctly. The reason for this issue is that docker images cannot access localhost on the host. To resolve this, use http://docker.for.mac.localhost:8090 as the proxy host:port on mac host and use docker run -it --network host scrapinghub/splash for linux with localhost:port. For linux, -p is invalidated since all services on the container will be on localhost.

Method 2 is best for a single proxy without rules. Method 4 is best for multiple proxies with rules.

I did not try other methods to see what they would return with these changes and why.

neoinageo
  • 31
  • 5
0

Alright I have been struggling with the same problem for a while now, but I found the solution for your first method on GitHub, which is based on what the Docker docs state:

The host has a changing IP address (or none if you have no network access). From 18.03 onwards our recommendation is to connect to the special DNS name host.docker.internal, which resolves to the internal IP address used by the host. The gateway is also reachable as gateway.docker.internal.

Meaning that you should/could use the "host.docker.internal" as host instead for your proxy E.g.

splash:on_request(function (request)
     request:set_proxy{
         host = "host.docker.internal",
         port = 8090
     }
end)

Here is the link to the explanation: https://github.com/scrapy-plugins/scrapy-splash/issues/99#issuecomment-386158523

Vasco
  • 837
  • 8
  • 9