-1

I'm configuring the robots.txt file for robots, and can't really understand what dirs I should block from them. Of course, I've read some infos at the internet, but yet there's some gap between what I want to know and what I've been found so far. So, it would be nice if you could help me and answer some questions:

  • What should I block from robots at robots.txt? It's not that simple. For example, I've got a PHP file INDEX in the root (with almost all the content), dir with engine in it, called ADMIN. In this dir there's lots of dirs and files, some of them are actually the data that INDEX in the root folder are using. The whole point here is, if I'll block the ADMIN dir from robots, would it still be getting normally all the data in INDEX that taken from ADMIN dir?

  • As before, there's INDEX PHP file with a PHP script that generates automatic links for next pages (limited, of course; depends on amount of data in ADMIN dir). Is this normally indexed by robots as normal links and all the data that follows this links?

  • If I wanna block ADMIN dir and all the files in it from robots, is it enough to write this?

    User-agent: *
    Disallow: /ADMIN/
    
unor
  • 92,415
  • 26
  • 211
  • 360
dotzzy
  • 5
  • 2

1 Answers1

1

Bots don’t care about your internal server-side system (well, they can’t see it to begin with).

They visit your website just like a human visitor: by following links (from your own site, from external sites, from your sitemap etc.), and some might possibly also "guess" URLs.

So what matters are your URLs.

If you have a URL that you don’t want bots to visit ("crawl"), disallow it in your robots.txt.

This robots.txt

# hosted at http://example.com/

User-agent: *
Disallow: /ADMIN/

would disallow crawling of URLs like the following:

  • http://example.com/ADMIN/
  • http://example.com/ADMIN/index.html
  • http://example.com/ADMIN/CMS/foo
  • http://example.com/ADMIN/images/foo.png

But the following URLs would still be allowed to crawl:

  • http://example.com/ADMIN
  • http://example.com/admin/
  • http://example.com/foo/ADMIN/
unor
  • 92,415
  • 26
  • 211
  • 360
  • Ok, thanks a lot. BUT... there's lots of hack bots around, and I'm afraid of them. Some of them don't abbey the rules written into the robots.txt. The questions are: 1) how to block bots that disobey the rules at robots.txt 2) How to prevent hacker-bots-like to indexing files that forbidden by robots.txt(not specifically server-side code files)? – dotzzy Mar 13 '15 at 14:00
  • @dotzzy: Yes, only polite bots follow your robots.txt. For other bots, you’d have to block them on the server-side (e.g., via `.htaccess` if you are using Apache, and/or via PHP). The hard part is how to *detect* them. -- Ideally you’d harden your site: don’t publish content you don’t want to get indexed (e.g., put it behind some kind of login), and make sure that your application is secure. – unor Mar 13 '15 at 14:26
  • Ok, so if i'll put a deny,allow only from my ip to some dirs/files, this bots wouldn't be able to access them, right? or there's some tricks they may use and get them scanned? – dotzzy Mar 13 '15 at 14:41
  • @dotzzy: Well, I guess this depends on your implementation. For example, your PHP application *could* be exploitable to deliver any file from your server. But in principle, yes, if your app is secure, I guess this would suffice. -- You might be interested in our sister site, [security.se]. – unor Mar 13 '15 at 14:45
  • Come on, don't scare me with exploits :D I think the .htaccess Order Deny,Allow Deny from all rule should do the trick then. Thanks a lot Mr! :) – dotzzy Mar 13 '15 at 14:59