How to block robots without robots.txt

Question

As we know, robots.txt helps us avoid indexing of certain webpages/section by web crawlers/robots. But there are certain disadvantages by using this method: 1. the web crawlers might not listen to robots.txt file; 2. you are exposing the folders you want to protect to everybody;

There is another way of blocking the folders you want to protect from crawlers? Keep in mind that those folders might be wanted to be accessible from the browser (like /admin).

score 2 · Answer 1 · answered May 02 '12 at 06:47

2

Check the User-Agent header on requests and issue a 403 if the header contains the name of a robot. This will block all of the honest robots but not the dishonest ones. But then again, if the robot was really honest, it would obey robots.txt.

answered May 02 '12 at 06:47

Dan D.

73,243
15
104
123

i though about making a whitelist with user agents, but is dirty and some new user agents might get banned. some other solution??? – machineaddict May 03 '12 at 15:57

How to block robots without robots.txt

1 Answers1