How to avoid getting blocked by websites when using Ruby Mechanize for web crawling

Question

I am successful scraping building data from a website (www.propertyshark.com) using a single address, but it looks like I get blocked once I use loop to scrape multiple addresses. Is there a way around this? FYI, the information I'm trying to access is not prohibited according to their robots.txt.

Codes for single run is as follows:

require 'mechanize'

class PropShark
  def initialize(key,link_key)
    @@key = key
    @@link_key = link_key
  end

  def crawl_propshark_single
    agent = Mechanize.new{ |agent|
      agent.user_agent_alias = 'Mac Safari'
    }
    agent.ignore_bad_chunking = true
    agent.verify_mode = OpenSSL::SSL::VERIFY_NONE

    page = agent.get('https://www.google.com/')
    form = page.forms.first
    form['q'] = "#{@@key}"
    page = agent.submit(form)
    page = form.submit  
    page.links.each do |link|
      if link.text.include?("#{@@link_key}")  
        if link.text.include?("PropertyShark")
          property_page = link.click
        else
          next
        end

        if property_page
          data_value = property_page.css("div.cols").css("td.r_align")[4].text # <--- error points to these commands
          data_name = property_page.css("div.cols").css("th")[4].text
          @result_hash["#{data_name}"] = data_value
        else
          next
        end
      end 
    end

    return @result_hash
  end
end #endof: class PropShark

# run
key = '41 coral St, Worcester, MA 01604 propertyshark'
key_link = '41 Coral Street'
spider = PropShark.new(key,key_link)
puts spider.crawl_propshark_single

I get the following errors but in an hour or two the error disappears:

undefined method `text' for nil:NilClass (NoMethodError)

When I use a loop using the above codes, I delay the process by having sleep 80 between addresses.

You can use a proxy to access the website so that your IP address stays hidden — Cyzanfar, Sep 06 '17 at 17:56
You can't scrape Google. The only way to do so is to use pretty large batch of proxies witch will be used in a random order alongside with random user agents. Otherwise you will get reCaptcha very soon, and it's pretty hard to solve it even using paid captcha solving services. So I would advice you to change algorithm to operate with the site itself, without google in between if it's possible. — nattfodd, Sep 06 '17 at 18:23
Have you considered maybe they don't want you scraping the site? — tadman, Sep 06 '17 at 18:32
@Cyzanfar You will need to renew the proxy settings constantly right? — Josh, Sep 06 '17 at 18:44
@nattfodd Thanks for the input but it was impossible getting through the site directly and I had to resort to using Google as if I'm using their search engine. — Josh, Sep 06 '17 at 18:54
Have you considered nothing is blocking you at all and that `property_page.css("div.cols").css("td.r_align")` sometimes does not have 5 elements? take this for instance `[1,2,3,4][4] #=> nil` — engineersmnky, Sep 06 '17 at 20:20
@sgroves "Unauthorized use of a computer system" is a very vague term, and yet that sort of thing is *extremely illegal* in the United States, so I'd hate to be complicit in giving advice here that lead to some kind of prosecution, no matter how likely that is. Don't think that people haven't been prosecuted for [precisely this thing](https://www.wired.com/2013/03/att-hacker-gets-3-years/). — tadman, Sep 06 '17 at 21:14
@sgroves They don't care. I know that sounds insane, but that's how things are these days. You *probably* won't get caught, but if you do, this could be bad news without the right representation. — tadman, Sep 06 '17 at 21:17
@tadman while I am inclined to agree that the government can be over zealous. Exploiting a security vulnerability and web scraping otherwise publicly accessible pages are very different acts. There are numerous companies that derive there whole business model from internet accessible data aggregation without prosecution. — engineersmnky, Sep 07 '17 at 00:51
@engineersmnky What I mean is the definition of "public" does not always mean what you think it is. If you contravene the "acceptable use policy" you're in a grey zone. — tadman, Sep 07 '17 at 01:09
@Brad Werth I had different solutions for different levels of blockage. Having time lapse between, e.g., `time 5` solved some websites. AWS proxy/ip addresses were being blocked so I purchased 1000 proxy settings and randomly `set_proxy` each time, but the above propertyshark.com still didn't work. I was investigating your suggestion to use Selenium, but I got caught w/ having to deliver some data output so i just ended up not using the propertyshark.com for now. — Josh, Oct 25 '17 at 17:25

Brad Werth · Answer 1 · 2017-09-06T20:25:48.000

The first thing you should do, before you do anything else, is to contact the website owner(s). Right now, you actions could be interpreted anywhere between overly aggressive and illegal. As others have pointed out, the owners may not want you scraping the site. Alternatively, they may have an API or product feed available for this particular thing. Either way, if you are going to be depending on this website for your product, you may want to consider playing nice with them.

With that being said, you are moving through their website with all of the grace of an elephant in a china store. Between the abnormal user agent, unusual usage patterns from a single IP, and a predictable delay between requests, you've completely blown your cover. Consider taking a more organic path through the site, with a more natural human-emulation delay. Also, you should either disguise your useragent, or make it super obvious (Josh's Big Bad Scraper). You may even consider using something like Selenium, which uses a real browser, instead of Mechanize, to give away fewer hints.

You may also consider adding more robust error handling. Perhaps the site is under excessive load (or something), and the page you are parsing is not the desired page, but some random error page. A simple retry may be all you need to get that data in question. When scraping, a poorly-functioning or inefficient site can be as much of an impediment as deliberate scraping protections.

If none of that works, you could consider setting up elaborate arrays of proxies, but at that point you would be much better of using one of the many online Webscraping/API creating/Data extraction services that currently exist. They are fairly inexpensive and already do everything discussed above, plus more.

If you're going to be doing a lot of this, you might check out https://www.nostarch.com/webbots2. It does a pretty good job of covering many of the issues you are running into. (But I would just subscribe to some 3rd aprty API creation site and be done with this project..) — Brad Werth, Sep 06 '17 at 20:21
@engineersmnky I do not believe this is an issue of getting one page. I believe this is a concentrated effort to scrape the entire contents of the site, likely on some kind of recurring schedule. — Brad Werth, Sep 06 '17 at 20:29
Yes, google does ask for permission, indirectly. It is super obvious, fully honors robots.txt directives, and uses an easily-identifiable user agent. Most people want it to consume their sites, those that do not have an easy remedy against it. The selector could be missing from a "Server down", "Rate limited", "Cloudflre protected", etc. page, in addition to (as you suppose) misshappen data. The legal advice, as well as the mention of an API or product feed was only meant to illustrate that there may be simpler ways to solve this problem (or reasons not to) that the OP may not be aware of. — Brad Werth, Sep 06 '17 at 20:53
From https://support.google.com/webmasters/answer/7424835?hl=en "When Googlebot visits a website, we first ask for permission to crawl by attempting to retrieve the robots.txt file" — Brad Werth, Sep 06 '17 at 20:53
Yeah, it sounds insane. Then again, it is a public server... That's why the trespass chattels stuff seems insane to me, too, sortof, but then how do you draw the line between Josh and his spider and a legit DOS attack... — Brad Werth, Sep 06 '17 at 21:01

engineersmnky · Answer 2 · 2017-09-06T20:42:26.457

2

It is very likely nothing is "blocking" you. As you pointed out

property_page.css("div.cols").css("td.r_align")[4].text

is the problem. So lets focus on that line of code for a second.

Say the first time round your columns are columns = [1,2,3,4,5] well then rows[4] will return 5 (the element at index 4).

No for fun let's assume the next go around your columns are columns = ['a','b','c','d'] well then rows[4] will return nil because there is nothing at the fourth index.

This appears to be your case where sometimes there are 5 columns and sometimes there are not. Thus leading to nil.text and the error you are recieving

edited Sep 06 '17 at 20:42

answered Sep 06 '17 at 20:34

engineersmnky

25,495
2
36
52

+1 this is almost certainly the literal issue with the code. The OP does not make it clear if the data is simply misshapen for that page or missing entirely due to a temporarily unavailable page. If it is more than a different data structure for one page, please refer to https://stackoverflow.com/a/46083671/525478 – Brad Werth Sep 06 '17 at 21:04
@BradWerth I actually do a count of the elements and do a loop around the count so that I don't end up with the issue you had raised. Thanks. – Josh Sep 07 '17 at 14:52
@Josh I wonder if you meant to add this comment to my answer? I'm not sure I'm understanding you... Judging by the code you posted, it looks like you are just clicking every link on the page, with no delay. Anyway, I feel like I've spent an inordinate amount of time on this question, and really have no more to provide. Good luck getting it all sorted out! – Brad Werth Sep 07 '17 at 16:02

How to avoid getting blocked by websites when using Ruby Mechanize for web crawling

2 Answers2