How to keep HTTrack Crawlers away from my website through robots.txt?

Question

I'm maintaining the website http://www.totalworkflow.co.uk and not sure if the HTTrack follow the instructions given in robots.txt file. If there is any answer that we can keep the HTTrack away from the website please suggest me implement with or else just tell the robot name so I could be able to block this crap from crawling my website. If this is not possible by robots.txt, please recommend if any other way to keep this robots away from the website?

You are right there is no necessity for spam crawlers to follow the guidelines given in the robots.txt file. I know that the robots.txt is only for genuine search engines only. However, the application HTTrack may look genuine if the developers hard code this application not to skip the robots.txt guidelines if provided. If this option is provided then the application would be really useful for the purpose intended. OK lets come to my issue, actually what I would like to find the solution is to keeps the HTTRack crawlers away without hard code anything on the web server. I try to solve this issue at the webmasters level first. However, your idea is great to consider in the future. Thank you

score 1 · Answer 1 · edited May 23 '17 at 11:43

It should obey robots.txt, but robots.txt is a thing that you don't have to obey (and actually a pretty good thing to find what you don't want other people to see for spam bots) so what's the guarantee that (even if it obeys robots now) some time in the future there won't be an option to ignore all robots.txt and metatags? I think a better way is to configure your server-side application to detect and block user agents. There is a chance that the user agent string is hardcoded somewhere in the crawler's source code and the user won't be able to change it to stop you from blocking that crawler. All you have to do is write a server script to spit out user agent information (or check server logs) and then create blocking rules according to this information. Alternatively, you can just google a list of known "bad agents". To block user agents on a server that supports HTACCESS, have a look at this thread for one way of doing it:

Block by useragent or empty referer

In HTTrack, the user agent can be hand-picked or hand-modified and under Options, you can opt-in to ignore robots.txt. Bottom-line is, like you said, you cannot prevent a crawler from crawling your site, unless you want to ban IP ranges or use other methods to actively refuse connections to (robots.txt is voluntarily). — Abel, Dec 16 '12 at 00:06

How to keep HTTrack Crawlers away from my website through robots.txt?

1 Answers1