0

This question has developed off an answer here.

My question therefore is what steps can one take to wend off standard scrapers?

Community
  • 1
  • 1
Saurabh Agarwal
  • 507
  • 2
  • 5
  • 16

5 Answers5

1

The key word in your question is "standard" scapers.

There's no way to prevent all possible bots from scraping your site as they could just pose as a regular visitor.

For the 'good' bots, one or both of robots.txt or a META tag specifying whether a bot can index content and/or follow links:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

For the 'bad' ones, you'll have to catch them once and block them on a combination of IP, request/referrer headers, etc.

Widor
  • 13,003
  • 7
  • 42
  • 64
1
  • use CAPTCHA
  • analyze traffic (from where and how often your pages are requested)
  • display text mixed with pictures
  • use more client data processing (JavaScript, Java, Flash)
Mike
  • 2,065
  • 25
  • 29
  • If you use a captcha include a time limit of 15-20 seconds for completion as it takes on average 20-30 seconds to get an answer out of any 'mechanical turk' breaking service or automated OCR script. – Skizz Jun 13 '12 at 18:44
1

In addition to all the previous mentions of robots.txt, the robots meta tag, and using more javascript, one of the most sure methods that I know of is to put restricted content behind a user login. This will limit all but purpose-built bots. Add a strong captcha (like reCAPTCHA) to the user login and purpose-built bots will be blocked too.

If a site is looking to verify the identity of a client (ie: including whether it's a bot), that's what user-logins are for. :)

User login's can also be disabled if strange activity is detected.

David
  • 13,133
  • 1
  • 30
  • 39
0

Simply by placing a meta tag like

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

This tells a bot that he may not index your site.

Gooey
  • 4,740
  • 10
  • 42
  • 76
  • 2
    only the ones that comply with correct robot rules. There are other bots, such as BLP_bbot (bloomberg) that ignore robot rules and crawl anyway. – kolin Jun 12 '12 at 11:55
  • Quick question, does that mean it is the bot's responsibility to check for this particular meta info (I am concerned about non-trusted bots)? I may be wrong but doesn't this approach still leave the site vulnerable to all kinds of scraping from the html source itself? – Saurabh Agarwal Jun 12 '12 at 11:55
  • True, if you got a bot not playing by the rules he will ignore this tag and just continue scraping. – Gooey Jun 12 '12 at 11:57
0

If you can do server-side processing of requests, you can analyze the user agent string and return a 403 if you detect a scraper. This would not be foolproof. An unscrupulous scraper could use a standard browser user agent to fool your code. False positives would deny your site to real users. You may end up denying search engines access to your pages.

But, if you can identify 'standard scrapers', this would be another tool to control access to scrapers which do not respect the robots tag.

Ray
  • 21,485
  • 5
  • 48
  • 64