3

I'm starting to use Mechanize gem for Ruby and I wonder if there is anyway a web server can detect and block activities from Mechanize agent?

If yes, what's the code or steps to block Mechanize to scrap or visit a site?

Thanks for all the fish
  • 1,671
  • 3
  • 17
  • 31

2 Answers2

2

There are a number of ways they can detect an automated process is hitting their site:

  • They can check the user-agent string.
  • They can see what you are requesting. A browser requests all the images and CSS in a HTML page. Mechanize will not by default.
  • A human pauses to read a page and understand what it says. A piece of code doesn't unless it's been programmed to pause it will run at full speed so requests follow one after another quickly.

These don't necessarily point to Mechanize running, but are fingerprints of code scraping a site.

What can they do about it?

  • Ban that user-agent.
  • Ban any requests from your IP number or domain or subnet.
  • Ban any requests from your IP number, domain or subnet that occur too quickly.

There are many different ways to go about those things, depending on their server and networking hardware.

This question is pretty off-topic for StackOverflow and probably should be asked on https://serverfault.com/ or https://webmasters.stackexchange.com/

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • I doing a web scraper with Mechanize, You Know How can avoid these locks? I have a website and return me a 403 error (and I try with a new IP, but it's the same) – José Castro Dec 17 '13 at 20:48
  • Your best bet is to read their terms-of-service and check with their support and see if they have provisions for doing what you want to do, either via a certain server or using an API. I won't help you avoid their blocks as that's your problem with them; I have no interest in being involved in any way when I don't know what you're doing with their pages and/or content. – the Tin Man Dec 17 '13 at 21:17
  • Is public information but they don't have API :( , I think they block me, because I was learning how to get website and I flood the server xD, then I use a timer but it was too late. thanks for your answer. – José Castro Dec 17 '13 at 21:26
  • Have you tried the simple thing and called them to ask how you can get back in their good-graces? – the Tin Man Dec 17 '13 at 21:28
0

You can put up a robots.txt file and hope people respect it.

If you start blocking by User-Agent string, they can just pretend to be IE.

Thilo
  • 257,207
  • 101
  • 511
  • 656