1

When using mechanize to pull some data from craigslist I keep getting the following error on Heroku: status: Net::HTTPForbidden 1.1 403 Forbidden

I am wondering what are some ways to prevent this from happening, my setup is below:

agent = Mechanize.new do |agent|
  agent.log              = @logger
  agent.user_agent_alias = 'Mac Safari'
  agent.robots           = false
end

Any ideas?

barnett
  • 1,572
  • 2
  • 13
  • 25
  • You have to figure out why they've forbidden it. 403 is just "No" with no real explanation. Try simplifying and use OpenURI to grab some pages and see what happens. Then try Mechanize with various user agent signatures. Or, contact them and ask them if they have an API. – the Tin Man Aug 27 '14 at 18:47
  • It could be based on geography, referer, cookies, or maybe you just hit them too hard. – pguardiario Aug 28 '14 at 00:20
  • Ya was running a scrape every 10 minutes which definitely would attract attention. Would there be a workaround, potentially changing the user agent? I tried dumping the cookies each scrape but still hitting 403 Errors. – barnett Aug 28 '14 at 15:18

2 Answers2

3

Figured I'd make this a bit cleaner. I had the same issue which I was able to resolve by requesting new headers:

@agent = Mechanize.new { |agent|
                      agent.user_agent_alias = 'Windows Chrome'}


@agent.request_headers

You should also include some error handling if you haven't already. I wrote the following to give an idea:

begin  #beginning of block for handling rescue
              @results_page = #getting some page and doing cool stuff
         #The following line puts mechanize to sleep when a new page is reached for 1/10 second.  This keeps you from overloading the site you're scraping and minimizing the chance of getting errors.  If you start to get '503' errors you should increase this number a little!
              @agent.history_added = Proc.new {sleep 0.1}

            rescue Mechanize::ResponseCodeError => exception
              if exception.response_code == "503"
                @agent.history_added = Proc.new {sleep .2}
              #the following line closes all active connections
                @agent.shutdown
                @agent = Mechanize.new { |agent|
                  agent.user_agent_alias = 'Windows Chrome'}
                @agent.request_headers
                @page = @agent.get('the-webpage-i-wanted.com')
                @form = @page.#GettingBackToWhereIWas
                redo 
                else
                #more error handling if needed
                end

***NOTE: Consider running this as a background process to avoid timeout errors on heroku since they only allow a 15-30 second request-response cycle. I use redisToGo (a heroku addon) and sidekiq (dl gem) if you're not doing it already!

bkunzi01
  • 4,504
  • 1
  • 18
  • 25
  • Can you post the example? – barnett Jun 17 '15 at 18:39
  • I forgot to add: @agent.request_headers after setting the alias. Also add rescues for errors you're sure to be getting: rescue Mechanize::ResponseCodeError => exception if exception.response_code == "413" – bkunzi01 Jun 18 '15 at 14:50
0

When working with mechanize and other such browser emulator you have to monitor your network, I prefer Google chrome developer tools.

Inspect your your URL with normal browser and check these:

  1. Is this URL valid?
  2. Is this URL public?
  3. Is this URL browser restricted?
  4. Is this URL secured by login?
  5. What parameters does this URL expect in normal conditions?

Debug these points because may be the URL you are accessing is restricted for:

  • Public use
  • May be it is directory path, where indexing is not allowed
  • May be server has restricted it for some user agents
  • May be you are not replicating request completely

I guess I am using too many "may be" but my point is if you can't post your link publicly I can just guess your error, In case your link is directly hitting a directory and its indexing is off then you can't browse it in mechanize either, If it is for specific user agents then you should initialize your mechanize with specific user agent like:

browser = Mechanize.new
browser.user_agent_alias = 'Windows IE 7'

In any other case you are not replicating your request either some important parameters are missing or you are sending wrong request type, headers may be missing.

EDIT: Now that you've provided link here is what you should do while dealing with https

Mechanize.new{|a| a.ssl_version, a.verify_mode = 'SSLv3', OpenSSL::SSL::VERIFY_NONE};
user2009750
  • 3,169
  • 5
  • 35
  • 58
  • An example link is: https://sfbay.craigslist.org/search/sfc/apa?bedrooms=1&maxAsk=2600&minAsk=1400&nh=10&nh=11&nh=12&nh=149&nh=17&nh=18&nh=20&nh=21&nh=22&nh=23&nh=27&nh=30&sale_date=-&sort=date which works fine within the browser. I tried rotating user agents as well and was still hit with problems. – barnett Sep 11 '14 at 19:47
  • So I set this up: `Mechanize.new { |agent| agent.log = logger; agent.user_agent_alias = 'Mac Safari'; agent.robots = false; agent.ssl_version; agent.verify_mode = 'SSLv3'; OpenSSL::SSL::VERIFY_NONE }` Yet don't understand why I am just printing the `ssl_version` and `OpenSSL::SSL::VERIFY_NONE` without setting it? When I tried using Mechanize with the above I keep getting `TypeError: no implicit conversion of String into Integer`. Any ideas? – barnett Sep 12 '14 at 18:21
  • I am not sure what are you trying to say, I checked the cadoe snippet u gave u it worked here – user2009750 Sep 12 '14 at 20:47
  • It works locally now but still get the `403 Forbidden` Errors on Heroku. Any other ideas? – barnett Sep 16 '14 at 21:13
  • @bklane did you ever figure this out? I am running into the same problem with Kayak. – Ruby_Pry Apr 27 '15 at 18:09
  • @Ruby_Pry nope, I assume they use something like [Rack Attack](https://github.com/kickstarter/rack-attack) which will track IP's on requests and then if to many requests in a certain time-frame, return errors. – barnett Apr 27 '15 at 18:14