3

I am using Net::HTTP for HTTP requests and getting a response back:

uri = URI("http://www.example.com")
http = Net::HTTP.start(uri.host, uri.port, proxy_host, proxy_port)
request = Net::HTTP::Get.new uri.request_uri
response = http.request request # Net::HTTPResponse object
body = response.body

If I have to use the Nokogiri gem in order to parse this HTML response I will do:

nokogiri_obj = Nokogiri::HTML(body)

But if I want to use Mechanize gem I need to do this:

agent = Mechanize.new
mechanize_obj = agent.get("http://www.example.com")

Is it possible for me to use Net::Http for getting the HTML response and then use the Mechanize gem to convert it into a Mechanize object instead of using agent.get()?


EDIT:

The reason for getting around the agent.get() method is because I am trying to use EventMachine::Iterator to make concurrent EM-HTTP requests.

EventMachine.run do
  EM::Iterator.new(urls, 3).each do |url,iter|
    puts "giving   #{url}   to httprequest now"
    http = EM::HttpRequest.new(url).get
    http.callback { |resp|
      uri = resp.send(:URI, url)
      puts "inside callback of #{url}"
      body = resp.response
      page = agent.parse(uri, resp, body)
    }
    iter.next
  end
end

But its not working. I am getting an error:

/usr/local/rvm/gems/ruby-1.9.3-p194/gems/mechanize-2.5.1/lib/mechanize.rb:1165:in`parse': undefined method `[]' for #<EventMachine::HttpClient:0x0000001c18eb30> (NoMethodError)

when I use the parse method for Net::HTTP it works fine and I get the Mechanize object:

 uri = URI("http://www.example.com")
 http = Net::HTTP.start(uri.host, uri.port, proxy_host, proxy_port)
 request = Net::HTTP::Get.new uri.request_uri
 response = http.request request # Net::HTTPResponse object
 body = response.body
 agent = Mechanize.new
 page = agent.parse(uri, response, body)     

Am I passing the wrong arguments for the parse method while using em-http?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
HPC_wizard
  • 179
  • 3
  • 11
  • 1
    Why would you want to do that? agent.get is so much simpler. – pguardiario Aug 21 '12 at 02:07
  • You are doing too much work. Mechanize will handle the `get` for you. Mechanize also uses Nokogiri internally for its parsing, so it's possible to request Nokogiri's parsed doc for you to do additional lookups. – the Tin Man Aug 21 '12 at 06:02

2 Answers2

3

I'm not sure why you think using Net::HTTP would be better. Mechanize will handle redirects and cookies, plus provides ready access to Nokogiri's parsed document.

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://www.example.com')

# Use Nokogiri to find the content of the <h1> tag...
puts page.at('h1').content # => "Example Domains"

Note, setting the user_agent isn't necessary to reach example.com.


If you want to use a threaded engine to retrieve pages, take a look at Typhoeous and Hydra.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • 1
    yes..actually I am using Mechanize the same way later in the code to scrape the required data. But I was wondering if I could combine em-http with mechanize as mentioned in the question.. – HPC_wizard Aug 22 '12 at 09:44
  • I'd recommend using Typhoeus. See my additional comment in my answer. – the Tin Man Aug 22 '12 at 15:52
1

Looks like Mechanize has a parse method, so this could work:

mechanize_obj = Mechanize.parse(uri, response, body)
Casper
  • 33,403
  • 4
  • 84
  • 79
  • thanks @Casper..Mechanize.parse method works correctly for Net::HTTP...how can I use the same for em-http? I think I am passing the wrong arguments to the 'parse' method while using it with em-http.. – HPC_wizard Aug 22 '12 at 09:47
  • @Gameboy I would post a new question for that issue. I'm not sure the response class of `em-http` is compatible with the `Net::HTTP` response, which is what `Mechanize` is expecting. You might need to monkey-patch something or convert the response to be compatible. – Casper Aug 22 '12 at 15:38