4

I'm coding web scrapers using Scrapy. A few sites that I need to access require me to interact with them so I'm making requests using Splash which allows me to do so. This currently works just fine.

To prevent my scrapers from getting blocked, I want the requests to go through a collection of proxy servers, so I used Scrapoxy for this.

The problem I have now is that to the best of my knowledge, the requests flow in the following way :-

Scrapy -> Scrapoxy -> Splash -> Target Website

Instead of :-

Scrapy -> Splash -> Scrapoxy -> Target Website

Is it possible to fix this?
If not, is it possible to use any other headless browser or proxy IP rotator which can solve this issue?

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
  • Splash's `Request` has a [`set_proxy` method](https://splash.readthedocs.io/en/stable/scripting-request-object.html?#request-set-proxy). You could probably adapt [this Splash script](https://doc.scrapinghub.com/crawlera.html#using-crawlera-with-splash) used to integrate Splash and Crawlera. – paul trmbrth Feb 03 '17 at 11:16

1 Answers1

0

You can use this script:

function main(splash)
    local host = "localhost"
    local port = 8888

    splash:on_request(function (request)
       request:set_proxy{host, port}
    end)

    splash:go(splash.args.url)
    return splash:png()
end
  • I'm currently using something essentially similar. Is there any way to make middlewares work with this setup? I'm interested in the Scrapoxy Blacklist middleware. Also, you have built something amazing. Appreciate your effort. – John Sundharam Feb 15 '17 at 04:59
  • Hello John, I'm sure you can do that in LUA. Splash has a on_response event (see http://splash.readthedocs.io/en/stable/scripting-ref.html#splash-on-response) And you can make HTTP request with LUA, to make a HTTP POST request on Scrapoxy. I'm very interesting if you find how to do that. I will add it to the Scrapoxy documentation! Fabien. – Fabien Vauchelles Feb 16 '17 at 13:58