2

I am using Ruby to scrape webpages that sometimes return redirects which I want to follow. There is many Ruby gems that do that, but there is a problem:

Ruby's URI.parse explodes on some URIs that are technically invalid but work in browsers like "http://www.google.com/?q=<>"

URI.parse("http://www.google.com/?q=<>")               #=> error

require 'addressable/uri'
Addressable::URI.parse("http://www.google.com/?q=<>")  #=> works

All the HTTP client libraries I have tried (HttParty, Faraday, RestClient) break when they encounter such a URI in a redirect (this is on ruby 1.9.3)

rest-client:

require 'rest-client'
RestClient.get("http://bitly.com/ReeuYv") #=> explodes

faraday:

require 'faraday'
require 'faraday_middleware'
Faraday.use(FaradayMiddleware::FollowRedirects)
Faraday.get("http://bitly.com/ReeuYv")    #=> explodes

httparty:

require 'httparty'
HTTParty.get("http://bitly.com/ReeuYv")   # => explodes

open-uri:

require 'open-uri'
open("http://bitly.com/ReeuYv")           # => explodes

What can I do to make this work?

levinalex
  • 5,889
  • 2
  • 34
  • 48
  • 1
    For what it's worth, `URI.parse` is actually just conforming to [RFC 3986](http://tools.ietf.org/html/rfc3986#page-13) in this case - `<` and `>` should be URL-encoded. Browsers are simply more forgiving. – Thilo Nov 06 '12 at 19:54
  • 1
    The assumption that a URI containing raw `<>` is valid is incorrect. – Mark Thomas Nov 06 '12 at 19:59
  • okay. but still, It would be nice if this worked. (corrected the question) – levinalex Nov 06 '12 at 20:00
  • A simple http gem that is as forgiving as a browser's address bar would be a good thing to exist - there are numerous situations where it's valid for url handling to match what people are used to in their day to day web usage. – robomc Oct 10 '13 at 23:14

4 Answers4

3

Mechanize is my favourite web scraping gem.

The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, and can follow links and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.

require 'mechanize'
agent = Mechanize.new
page = agent.get('http://bitly.com/ReeuYv')
puts page.uri.to_s
=> http://www.google.com/?q=%3C%3E

It uses nokogiri to parse the html so every Mechanize::Page object can be treated like a nokogiri object so you can get bits of the html like

puts page.form('f').q
=> <>

The last part might seem like black magic but you really need to try pp page yourself. It makes the HTML so easy to scrape.

Here's a guide to get started with and the documentation.

sunnyrjuneja
  • 6,033
  • 2
  • 32
  • 51
2

Typhoeus works:

require 'typhoeus'
Typhoeus::VERSION #=> "0.5.0.rc" 
Typhoeus.get("http://bitly.com/ReeuYv", followlocation: true).body
levinalex
  • 5,889
  • 2
  • 34
  • 48
1

Curb seems to work:

require 'curb'
Curl.get("http://bitly.com/ReeuYv") { |c| 
  c.follow_location = true 
}.body_str  #=>  works
levinalex
  • 5,889
  • 2
  • 34
  • 48
0

This will work:

uri = URI.escape "http://www.google.com/?q=<>"


#=> "http://www.google.com/?q=%3C%3E"


URI.parse(uri) #=> no error
Mark Thomas
  • 37,131
  • 11
  • 74
  • 101
  • Yes. But I don't get to escape the URI because it is returned in a 302 response from some other server and handled deep inside whichever HTTP library I am using. (see the examples in the question, `http://bitly.com/ReeuYv` is a working uri that demonstrates the problem) – levinalex Nov 06 '12 at 20:04