I am successful scraping building data from a website (www.propertyshark.com) using a single address, but it looks like I get blocked once I use loop to scrape multiple addresses. Is there a way around this? FYI, the information I'm trying to access is not prohibited according to their robots.txt.
Codes for single run is as follows:
require 'mechanize'
class PropShark
def initialize(key,link_key)
@@key = key
@@link_key = link_key
end
def crawl_propshark_single
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
page = agent.get('https://www.google.com/')
form = page.forms.first
form['q'] = "#{@@key}"
page = agent.submit(form)
page = form.submit
page.links.each do |link|
if link.text.include?("#{@@link_key}")
if link.text.include?("PropertyShark")
property_page = link.click
else
next
end
if property_page
data_value = property_page.css("div.cols").css("td.r_align")[4].text # <--- error points to these commands
data_name = property_page.css("div.cols").css("th")[4].text
@result_hash["#{data_name}"] = data_value
else
next
end
end
end
return @result_hash
end
end #endof: class PropShark
# run
key = '41 coral St, Worcester, MA 01604 propertyshark'
key_link = '41 Coral Street'
spider = PropShark.new(key,key_link)
puts spider.crawl_propshark_single
I get the following errors but in an hour or two the error disappears:
undefined method `text' for nil:NilClass (NoMethodError)
When I use a loop using the above codes, I delay the process by having sleep 80
between addresses.