3

Hi I use gem Nokogiri to scrape the gem getails from ruby-toolbox

Nokogiri::HTML(open("https://www.ruby-toolbox.com/categories/by_name"))

but I get the error: "403 Forbidden"

Can anyone tell me why I am getting this error?

Thanks in advance

pavel
  • 26,538
  • 10
  • 45
  • 61
Siva KB
  • 357
  • 1
  • 6
  • 19

4 Answers4

7

Try to change your user-agent:

Nokogiri::HTML(open("https://www.ruby-toolbox.com/categories/by_name", 'User-Agent' => 'firefox'))

www.ruby-toolbox.com doesn't seem to accept 'ruby' as an agent.

Oliver Zeyen
  • 783
  • 5
  • 7
1

As mentioned, the user agent has to be changed. However, in addition to that you have to disable the SSL certificate verification since it would throw an error as well.

require 'nokogiri'
require 'open-uri'
require 'openssl'

url = 'https://www.ruby-toolbox.com/categories/by_name'
content = open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE, 'User-Agent' => 'opera')
doc = Nokogiri::HTML(content)
doc.xpath('//div[@id="teaser"]//h2/text()').to_s
# "All Categories by name"
Daniël Knippers
  • 3,049
  • 1
  • 11
  • 17
  • 1
    It would be good for you to explain why disabling verification works, and why it's there in the first place, and what problems turning it off can cause. SSL without verification is crippled. – the Tin Man Jul 15 '14 at 20:12
0

This seems to be an OpenURI issue. Try this:

Nokogiri::HTML(open("https://www.ruby-toolbox.com/categories/by_name", 'User-Agent' => 'ruby'))
Community
  • 1
  • 1
dax
  • 10,779
  • 8
  • 51
  • 86
0

I spent ~1 hour trying solutions for a 403 forbidden, including tinkering with the User-Agent argument to Nokogiri::HTML(open(www.something.com, User-Agent: "Safari")), looking into proxies, and other things.

But the whole time there was nothing wrong with my code, the website I had been automated browsing had subtly changed url, and the url it previously visited was fobidden.

I hope this may save someone else some time.

stevec
  • 41,291
  • 27
  • 223
  • 311