Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
219
votes
3 answers

Can a relative sitemap url be used in a robots.txt?

In robots.txt can I write the following relative URL for the sitemap file? sitemap: /sitemap.ashx Or do I have to use the complete (absolute) URL for the sitemap file, like: sitemap: http://subdomain.domain.com/sitemap.ashx Why I wonder: I own a…
Easyrider
  • 3,199
  • 5
  • 22
  • 32
161
votes
5 answers

How to configure robots.txt to allow everything?

My robots.txt in Google Webmaster Tools shows the following values: User-agent: * Allow: / What does it mean? I don't have enough knowledge about it, so looking for your help. I want to allow all robots to crawl my website, is this the right…
Raajpoot
  • 1,611
  • 2
  • 10
  • 3
101
votes
10 answers

Static files in Flask - robot.txt, sitemap.xml (mod_wsgi)

Is there any clever solution to store static files in Flask's application root directory. robots.txt and sitemap.xml are expected to be found in /, so my idea was to create routes for them: @app.route('/sitemap.xml', methods=['GET']) def sitemap(): …
biesiad
  • 2,258
  • 4
  • 19
  • 16
99
votes
3 answers

Ignore URLs in robot.txt with specific parameters?

I would like Google to ignore URLs like this: http://www.mydomain.example/new-printers?dir=asc&order=price&p=3 In other words, all the URLs that have the parameters dir, order and price should be ignored. How do I do so with robots.txt?
Luis Valencia
  • 32,619
  • 93
  • 286
  • 506
83
votes
9 answers

What is the smartest way to handle robots.txt in Express?

I'm currently working on an application built with Express (Node.js) and I want to know what is the smartest way to handle different robots.txt for different environments (development, production). This is what I have right now but I'm not convinced…
Vinch
  • 1,551
  • 3
  • 13
  • 15
78
votes
5 answers

How to stop Google indexing my Github repository

I use Github to store the text of one of my web sites, but the problem is Google indexing the text in Github as well. So the same text will show up both on my site and on Github. e.g. this search The top hit is my site. The second hit is the Github…
szabgab
  • 6,202
  • 11
  • 50
  • 64
76
votes
9 answers

Stop Google from indexing

Is there a way to stop Google from indexing a site?
Developer
  • 17,809
  • 26
  • 66
  • 92
59
votes
4 answers

robots.txt to disallow all pages except one? Do they override and cascade?

I want one page of my site to be crawled and no others. Also, if it's any different than the answer above, I would also like to know the syntax for disallowing everything but the root (index) of the website is. # robots.txt for…
nouveau
  • 1,162
  • 1
  • 8
  • 14
56
votes
2 answers

robots.txt allow root only, disallow everything else?

I can't seem to get this to work but it seems really basic. I want the domain root to be crawled http://www.example.com But nothing else to be crawled and all subdirectories are dynamic http://www.example.com/* I tried User-agent: * Allow:…
cotopaxi
  • 897
  • 1
  • 11
  • 19
55
votes
1 answer

robots.txt and .htaccess syntax highlight

Is there a way to colorcode/highlight robots.txt and .htaccess syntax? E.g. with a SublimeText2 plug-in. I found this, but can't figure out how to install it: https://github.com/shellderp/sublime-robot-plugin
Geo
  • 12,666
  • 4
  • 40
  • 55
51
votes
5 answers

Multiple Sitemap: entries in robots.txt?

I have been searching around using Google but I can't find an answer to this question. A robots.txt file can contain the following line: Sitemap: http://www.mysite.com/sitemapindex.xml but is it possible to specify multiple sitemap index files in…
user306942
  • 815
  • 2
  • 8
  • 6
44
votes
6 answers

What is the use of the hackers.txt file?

First No I am not asking you to teach me hacking, I am just curious about this file and its content. My journey When I dived into the new HTML5 Boilerplate I came accross the humans.txt. I googled for it and I came at this site…
Ron van der Heijden
  • 14,803
  • 7
  • 58
  • 82
37
votes
3 answers

How do I disallow specific page from robots.txt

I am creating two pages on my site that are very similar but serve different purposes. One is to thank users for leaving a comment and the other is to encourage users to subscribe. I don't want the duplicate content but I do want the pages to be…
Daniel
  • 6,758
  • 6
  • 31
  • 29
35
votes
10 answers

Ethics of robots.txt

I have a serious question. Is it ever ethical to ignore the presence of a robots.txt file on a website? These are some of the considerations I've got in mind: If someone puts a web site up they're expecting some visits. Granted, web crawlers are…
Onorio Catenacci
  • 14,928
  • 14
  • 81
  • 132
34
votes
2 answers

django serving robots.txt efficiently

Here is my current method of serving robots.txt url(r'^robots\.txt/$', TemplateView.as_view(template_name='robots.txt', content_type='text/plain')), I don't think that this is the best way. I think it…
Lucas Ou-Yang
  • 5,505
  • 13
  • 43
  • 62
1
2 3
95 96