work around Ruby's broken URI.parse, follow redirects

Question

I am using Ruby to scrape webpages that sometimes return redirects which I want to follow. There is many Ruby gems that do that, but there is a problem:

Ruby's URI.parse explodes on some URIs that are technically invalid but work in browsers like "http://www.google.com/?q=<>"

URI.parse("http://www.google.com/?q=<>")               #=> error

require 'addressable/uri'
Addressable::URI.parse("http://www.google.com/?q=<>")  #=> works

All the HTTP client libraries I have tried (HttParty, Faraday, RestClient) break when they encounter such a URI in a redirect (this is on ruby 1.9.3)

rest-client:

require 'rest-client'
RestClient.get("http://bitly.com/ReeuYv") #=> explodes

faraday:

require 'faraday'
require 'faraday_middleware'
Faraday.use(FaradayMiddleware::FollowRedirects)
Faraday.get("http://bitly.com/ReeuYv")    #=> explodes

httparty:

require 'httparty'
HTTParty.get("http://bitly.com/ReeuYv")   # => explodes

open-uri:

require 'open-uri'
open("http://bitly.com/ReeuYv")           # => explodes

What can I do to make this work?

For what it's worth, `URI.parse` is actually just conforming to [RFC 3986](http://tools.ietf.org/html/rfc3986#page-13) in this case - `<` and `>` should be URL-encoded. Browsers are simply more forgiving. — Thilo, Nov 06 '12 at 19:54
The assumption that a URI containing raw `<>` is valid is incorrect. — Mark Thomas, Nov 06 '12 at 19:59
okay. but still, It would be nice if this worked. (corrected the question) — levinalex, Nov 06 '12 at 20:00
A simple http gem that is as forgiving as a browser's address bar would be a good thing to exist - there are numerous situations where it's valid for url handling to match what people are used to in their day to day web usage. — robomc, Oct 10 '13 at 23:14

score 3 · Answer 1 · answered Nov 06 '12 at 21:32

Mechanize is my favourite web scraping gem.

The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, and can follow links and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.

require 'mechanize'
agent = Mechanize.new
page = agent.get('http://bitly.com/ReeuYv')
puts page.uri.to_s
=> http://www.google.com/?q=%3C%3E

It uses nokogiri to parse the html so every Mechanize::Page object can be treated like a nokogiri object so you can get bits of the html like

puts page.form('f').q
=> <>

The last part might seem like black magic but you really need to try pp page yourself. It makes the HTML so easy to scrape.

Here's a guide to get started with and the documentation.

score 2 · Answer 2 · answered Nov 06 '12 at 20:13

2

Typhoeus works:

require 'typhoeus'
Typhoeus::VERSION #=> "0.5.0.rc" 
Typhoeus.get("http://bitly.com/ReeuYv", followlocation: true).body

answered Nov 06 '12 at 20:13

levinalex

5,889
2
34
48

score 1 · Answer 3 · answered Nov 06 '12 at 19:51

1

Curb seems to work:

require 'curb'
Curl.get("http://bitly.com/ReeuYv") { |c| 
  c.follow_location = true 
}.body_str  #=>  works

answered Nov 06 '12 at 19:51

levinalex

5,889
2
34
48

score 0 · Answer 4 · answered Nov 06 '12 at 20:02

0

This will work:

uri = URI.escape "http://www.google.com/?q=<>"


#=> "http://www.google.com/?q=%3C%3E"


URI.parse(uri) #=> no error

answered Nov 06 '12 at 20:02

Mark Thomas

37,131
11
74
101

Yes. But I don't get to escape the URI because it is returned in a 302 response from some other server and handled deep inside whichever HTTP library I am using. (see the examples in the question, `http://bitly.com/ReeuYv` is a working uri that demonstrates the problem) – levinalex Nov 06 '12 at 20:04

work around Ruby's broken URI.parse, follow redirects

rest-client:

faraday:

httparty:

open-uri:

4 Answers4