2

I need collect all "title" from all pages from site.
Site have HTTP Basic Auth configuration.
Without auth I do next:

require 'anemone'
Anemone.crawl("http://example.com/") do |anemone|
  anemone.on_every_page do |page|
    puts page.doc.at('title').inner_html rescue nil
  end
end

But I have some problem with HTTP Basic Auth...
How I can collected titles from site with HTTP Basic Auth?
If I try use "Anemone.crawl("http://username:password@example.com/")" then I have only first page title, but other links have http://example.com/ style and I received 401 error.

George Cummins
  • 28,485
  • 8
  • 71
  • 90
Sergey Blohin
  • 600
  • 1
  • 4
  • 31
  • Сергей, you should probably spell your name with Latin letters, "Sergey Blokhin". Otherwise people won't be able to type your name to mention you in a comment. Heck, they won't be able to read it even! :) – Sergio Tulentsev May 30 '13 at 21:24
  • @SergioTulentsev thank you, I am changed my display name. :) – Sergey Blohin May 30 '13 at 21:33

1 Answers1

5

HTTP Basic Auth works via HTTP headers. Client, willing to access restricted resource, must provide authentication header, like this one:

Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==

It contains name and password, Base64-encoded. More info is in Wikipedia article: Basic Access Authentication.

I googled a little bit and didn't find a way to make Anemone accept custom request headers. Maybe you'll have more luck.

But I found another crawler that claims it can do it: Messie. Maybe you should give it a try

Update

Here's the place where Anemone sets its request headers: Anemone::HTTP. Indeed, there's no customization there. You can monkeypatch it. Something like this should work (put this somewhere in your app):

module Anemone
  class HTTP
    def get_response(url, referer = nil)
      full_path = url.query.nil? ? url.path : "#{url.path}?#{url.query}"

      opts = {}
      opts['User-Agent'] = user_agent if user_agent
      opts['Referer'] = referer.to_s if referer
      opts['Cookie'] = @cookie_store.to_s unless @cookie_store.empty? || (!accept_cookies? && @opts[:cookies].nil?)

      retries = 0
      begin
        start = Time.now()
        # format request
        req = Net::HTTP::Get.new(full_path, opts)
        response = connection(url).request(req)
        finish = Time.now()
        # HTTP Basic authentication
        req.basic_auth 'your username', 'your password' # <<== tweak here
        response_time = ((finish - start) * 1000).round
        @cookie_store.merge!(response['Set-Cookie']) if accept_cookies?
        return response, response_time
      rescue Timeout::Error, Net::HTTPBadResponse, EOFError => e
        puts e.inspect if verbose?
        refresh_connection(url)
        retries += 1
        retry unless retries > 3
      end
    end
  end
end

Obviously, you should provide your own values for the username and password params to the basic_auth method call. It's quick and dirty and hardcode, yes. But sometimes you don't have time to do things in a proper manner. :)

Sergio Tulentsev
  • 226,338
  • 43
  • 373
  • 367