Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
28
votes
11 answers

Meta tag vs robots.txt

Is it better to use meta tags* or the robots.txt file for informing spiders/crawlers to include or exclude a page? Are there any issues in using both the meta tags and the robots.txt? *Eg: <#META name="robots" content="index, follow">
keruilin
  • 16,782
  • 34
  • 108
  • 175
27
votes
4 answers

How to add `nofollow, noindex` all pages in robots.txt?

I want to add nofollow and noindex to my site whilst it's being built. The client has request I use these rules. I am aware of But I only have access to the robots.txt file. Does anyone know the…
MeltingDog
  • 14,310
  • 43
  • 165
  • 295
26
votes
4 answers

Stopping index of Github pages

I have a github page from my repository username.github.io However I do not want Google to crawl my website and absolutely do not want it to show up on search results. Will just using robots.txt in github pages work? I know there are…
user2961712
  • 469
  • 1
  • 7
  • 17
25
votes
5 answers

Robots.txt: allow only major SE

Is there a way to configure the robots.txt so that the site accepts visits ONLY from Google, Yahoo! and MSN spiders?
vyger
24
votes
2 answers

robots.txt file for different domains of same site

I have an ASP.NET MVC 4 web application that can be accessed from multiple different domains. The site is fully localized based on the domain in the request (similar in concept to this question). I want to include a robots.txt file and I want to…
amateur
  • 43,371
  • 65
  • 192
  • 320
24
votes
3 answers

Serving sitemap.xml and robots.txt with Spring MVC

What is the best way to server sitemap.xml and robots.txt with Spring MVC? I want server these files through Controller in cleanest way.
michal.kreuzman
  • 12,170
  • 10
  • 58
  • 70
23
votes
5 answers

How to set up a robot.txt which only allows the default page of a site

Say I have a site on http://example.com. I would really like allowing bots to see the home page, but any other page need to blocked as it is pointless to spider. In other words http://example.com & http://example.com/ should be allowed, but…
Boaz
  • 25,331
  • 21
  • 69
  • 77
23
votes
3 answers

Does robots.txt apply to subdomains?

Let's say I have a test folder (test.domain.com) and I don't want the search engines to crawl in it, do I need to have a robots.txt in the test folder or can I just place a robots.txt in the root, then just disallow the test folder?
Pa3k.m
  • 994
  • 2
  • 7
  • 22
23
votes
6 answers

How do i configure nginx to redirect to a url for robots.txt & sitemap.xml

I am running nginx 0.6.32 as a proxy front-end for couchdb. I have my robots.txt in the database, reachable as http://www.example.com/prod/_design/mydesign/robots.txt. I also have my sitemap.xml which is dynamically generated, on a similar url. I…
timbo
  • 13,244
  • 8
  • 51
  • 71
22
votes
3 answers

Robots.txt Allow sub folder but not the parent

Can anybody please explain the correct robots.txt command for the following scenario. I would like to allow access to: /directory/subdirectory/.. But I would also like to restrict access to /directory/ not withstanding the above exception.
QFDev
  • 8,668
  • 14
  • 58
  • 85
21
votes
3 answers

Angular2 + webpack do not deploy robots.txt

I am creating a web site with Angular2@2.1.2. I am using Webpack with default settings (as a dependency). Here is my package.json "dependencies": { "@angular/common": "2.1.2", "@angular/compiler": "2.1.2", "@angular/core": "2.1.2", "@angular/forms":…
Guymage
  • 1,524
  • 1
  • 14
  • 21
20
votes
2 answers

Ruby on Rails robots.txt folders

I'm about to launch a Ruby on Rails application and as the last task, I want to set the robots.txt file. I couldn't find information about how the paths should be written properly for a Rails application. Is the starting path always the root path…
Linus
  • 4,643
  • 8
  • 49
  • 74
20
votes
1 answer

Robots.txt - What is the proper format for a Crawl Delay for multiple user agents?

Below is a sample robots.txt file to Allow multiple user agents with multiple crawl delays for each user agent. The Crawl-delay values are for illustration purposes and will be different in a real robots.txt file. I have searched all over the web…
Sammy
  • 877
  • 1
  • 10
  • 23
19
votes
5 answers

how to prevent staging to be indexed in search engines

I would like my staging web sites to no being indexed by search engines (Google as first). I have heard Wordpress is good at doing this but I would like to be technology agnostic. Does the robots.txt is enough ? We would like to keep anonymous…
toutpt
  • 5,145
  • 5
  • 38
  • 45
18
votes
4 answers

How to stop search engines from crawling the whole website?

I want to stop search engines from crawling my whole website. I have a web application for members of a company to use. This is hosted on a web server so that the employees of the company can access it. No one else (the public) would need it or…
Iain Simpson
  • 8,011
  • 13
  • 47
  • 66
1
2
3
95 96