18

I want to stop search engines from crawling my whole website.

I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or find it useful.

So I want to add another layer of security (In Theory) to try and prevent unauthorized access by totally removing access to it by all search engine bots/crawlers. Having Google index our site to make it searchable is pointless from the business perspective and just adds another way for a hacker to find the website in the first place to try and hack it.

I know in the robots.txt you can tell search engines not to crawl certain directories.

Is it possible to tell bots not to crawl the whole site without having to list all the directories not to crawl?

Is this best done with robots.txt or is it better done by .htaccess or other?

Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
Iain Simpson
  • 8,011
  • 13
  • 47
  • 66
  • 2
    Your website is reachable by black hat hackers even if no search engines index your site. Black Hat hackers are not doing Google searches to find you. They have their own bot-nets that crawl the web and ignore `robots.txt`. Also, you are harming the internet by making it harder for the employees to find your website via google search. Google doesn't like it when you do this and you aren't making your website more secure. Plus Google helps you by bringing in customers. It's like TSA where take your 1 inch blade from you, they aren't making things safer, and they are hassling everybody. – Eric Leschinski Mar 22 '13 at 15:41

4 Answers4

16

Using robots.txt to keep a site out of search engine indexes has one minor and little-known problem: if anyone ever links to your site from any page indexed by Google (which would have to happen for Google to find your site anyway, robots.txt or not), Google may still index the link and show it as part of their search results, even if you don't allow them to fetch the page the link points to.

If this might be a problem for you, the solution is to not use robots.txt, but instead to include a robots meta tag with the value noindex,nofollow on every page on your site. You can even do this in a .htaccess file using mod_headers and the X-Robots-Tag HTTP header:

Header set X-Robots-Tag noindex,nofollow

This directive will add the header X-Robots-Tag: noindex,nofollow to every page it applies to, including non-HTML pages like images. Of course, you may want to include the corresponding HTML meta tag too, just in case (it's an older standard, and so presumably more widely supported):

<meta name="robots" content="noindex,nofollow" />

Note that if you do this, Googlebot will still try to crawl any links it finds to your site, since it needs to fetch the page before it sees the header / meta tag. Of course, some might well consider this a feature instead of a bug, since it lets you look in your access logs to see if Google has found any links to your site.

In any case, whatever you do, keep in mind that it's hard to keep a "secret" site secret very long. As time passes, the probability that one of your users will accidentally leak a link to the site approaches 100%, and if there's any reason to assume that someone would be interested in finding the site, you should assume that they will. Thus, make sure you also put proper access controls on your site, keep the software up to date and run regular security checks on it.

Community
  • 1
  • 1
Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
  • 3
    This is akin to increasing security at your home by putting a sign on the doors/windows that says: "Burglars, don't look at this house". The Burglar says: "lol". – Eric Leschinski Mar 22 '13 at 15:50
  • 1
    This isn't really about 'security', it's about sensible search results. I wrote an early Wiki, and as soon as Google found it, all the page histories, differences to previous versions, and 'edit' pages were indexed- not friendly places for users to land. Removing the 'edit' and 'history' pages from Google's history doesn't make them more secure, but it helps keep that junk out of Google and helps users to arrive in the right place. – Galax Feb 05 '16 at 14:31
14

It is best handled with a robots.txt file, for just bots that respect the file.

To block the whole site add this to robots.txt in the root directory of your site:

User-agent: *
Disallow: /

To limit access to your site for everyone else, .htaccess is better, but you would need to define access rules, by IP address for example.

Below are the .htaccess rules to restrict everyone except your people from your company IP:

Order allow,deny
# Enter your companies IP address here
Allow from 255.1.1.1
Deny from all 
nyedidikeke
  • 6,899
  • 7
  • 44
  • 59
Ulrich Palha
  • 9,411
  • 3
  • 25
  • 31
  • Thanks for that, the robots.txt info is really helpful, I would love to only allow just the company ip range but the app is going to be used by reps on the road so their ip's can change all the time, otherwise I would certainly do that. Thanks :-) – Iain Simpson Feb 01 '12 at 20:44
  • Is there way to block bad bots too, e.g by identifying them as being bots and not users and blocking them, as there is no reason anything other than a human should be accessing the website. – Iain Simpson Feb 01 '12 at 20:49
  • @IainSimpson You could try to deny `bots` based on userAgent, but it would be easy to spoof and its very likely that bad bots would not identify themselves as bots to begin with... – Ulrich Palha Feb 01 '12 at 21:07
0

If security is your concern, and locking down to IP addresses isn't viable, you should look into requiring your users to authenticate in someway to access your site.

That would mean that anyone (google, bot, person-who-stumbled-upon-a-link) who isn't authenticated, wouldn't be able to access your pages.

You could bake it into your website itself, or use HTTP Basic Authentication.

https://www.httpwatch.com/httpgallery/authentication/

ChrisW
  • 332
  • 3
  • 7
0

In addition to the provided answers, you can stop search engines from crawling/indexing a specific page on your website in .robot.text. Below is an example:

User-agent: *
Disallow: /example-page/ 

The above example is especially handy when you have dynamic pages, otherwise, you may want to add the below HTML meta tag on the specific pages you want to be disallowed from search engines:

<meta name="robots" content="noindex, nofollow" />
Harrison O
  • 1,119
  • 15
  • 20