0

So I have a nokogiri web scrape running perfectly on my local machine.

However when I try and run the web scrape on my production environment it get a 403 error code appear.

I believe this is down to the website blocking my ip of my server (probably because previous people using that ip have blocked it)

Is it possible to route the nokogiri request from my web server through a proxy server? If so how would I go about it?

This is the code I have at the moment.

doc = Nokogiri::HTML(open(URL HERE, 'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.854.0 Safari/535.2'))
thesecretmaster
  • 1,950
  • 1
  • 27
  • 39
  • Where are you getting the 403 from? From the websites your trying to scrape? – thesecretmaster Jun 21 '16 at 09:09
  • Indeed i am, I'm under the impression that they've blocked the server ip address, Thats why i thought of a proxy – sam.roberts55 Jun 21 '16 at 09:33
  • Can you use Mechanise and proxy for it? Look [here](http://stackoverflow.com/questions/18348673/how-do-i-configure-a-ruby-mechanize-agent-to-work-through-the-charles-web-proxy) or [here](https://gist.github.com/emergent/3983870) – Pavel Bulanov Jun 21 '16 at 09:42
  • I had a very very quick scan read, Isn't the charles proxy thing a desktop client? Thanks – sam.roberts55 Jun 21 '16 at 09:47
  • It's true for Charles, but it's just a sample of proxy, i.e. ("localhost", 8888) in the example, which might be anything for your purpose. Actually, you can simply pass proxy to open method (see answer below), it's just I was using Mechanize all the time as a wrapper on Nokogiri. – Pavel Bulanov Jun 21 '16 at 09:53
  • Nokogiri has nothing to do with sending requests, so no, you can't use a proxy with it. `open` is being patched by `OpenURI` and that is what makes the web request and returns the 403 error. – the Tin Man Jun 22 '16 at 17:44

1 Answers1

0

Actually, you can simply use the :proxy parameter of the OpenURI open method.

open(*rest, &block)
#open provides `open' for URI::HTTP and URI::FTP.

...

The hash may include other options, where keys are symbols:
:proxy

Synopsis:    
:proxy => "http://proxy.foo.com:8000/"
:proxy => URI.parse("http://proxy.foo.com:8000/")

If :proxy option is specified, the value should be String, URI, boolean or nil.

Also, as a general consideration (being tedious now), you should search for alternatives around scrapping content, especially if it's done on a regular basis. Things like supported API or alternative sources. If your current server IP got blocked, the same can happen to the proxy.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Pavel Bulanov
  • 933
  • 6
  • 13
  • 1
    Probably you won't get good _and_ free proxies. Free proxies work randomly, stop working occasionally, and so forth. You can work with them, but not for something that should be reliable. For reliable proxies you should search for paid services, there are many (horde of) and I can't judge on which ones are good or bad. – Pavel Bulanov Jun 21 '16 at 10:05
  • 1
    Also, as a general consideration (being tedious now), you should search for alternatives around scrapping content, especially if it's done on a regular basis. Things like supported API or alternative sources. If your current server IP got blocked, same can happen to the proxy. – Pavel Bulanov Jun 21 '16 at 10:07
  • Yeah i would prefer a api but the api the web provider use is either out of date or not updated alongside the website. – sam.roberts55 Jun 21 '16 at 10:34