1

As we know, robots.txt helps us avoid indexing of certain webpages/section by web crawlers/robots. But there are certain disadvantages by using this method: 1. the web crawlers might not listen to robots.txt file; 2. you are exposing the folders you want to protect to everybody;

There is another way of blocking the folders you want to protect from crawlers? Keep in mind that those folders might be wanted to be accessible from the browser (like /admin).

machineaddict
  • 3,216
  • 8
  • 37
  • 61

1 Answers1

2

Check the User-Agent header on requests and issue a 403 if the header contains the name of a robot. This will block all of the honest robots but not the dishonest ones. But then again, if the robot was really honest, it would obey robots.txt.

Dan D.
  • 73,243
  • 15
  • 104
  • 123
  • i though about making a whitelist with user agents, but is dirty and some new user agents might get banned. some other solution??? – machineaddict May 03 '12 at 15:57