Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
11
votes
2 answers

How to disable robots.txt when you launch scrapy shell?

I use Scrapy shell without problems with several websites, but I find problems when the robots (robots.txt) does not allow access to a site. How can I disable robots detection by Scrapy (ignored the existence)? Thank you in advance. I'm not talking…
DARDAR SAAD
  • 392
  • 1
  • 3
  • 17
11
votes
7 answers

robots.txt; What encoding?

I am about to create a robots.txt file. I am using notepad. How should I save the file? UTF8, ANSI or what? Also, should it be a capital R? And in the file, I am specifying a sitemap location. Should this be with a capital S? User-agent: * …
user188962
10
votes
2 answers

Listing both sitemaps and sitemap index files in robots.txt?

My site is comprised of 3 main sections: Reviews, Forum, and Blog. I have plugins for the forum and blog that automatically generate sitemaps for them. The forum plugin generates a sitemap INDEX file pointing to multiple indexes, and the blog plugin…
Chris
  • 1,273
  • 5
  • 19
  • 33
10
votes
2 answers

Robots.txt priority question

If I have these lines in robots.txt: Disallow /folder/ Allow /folder/filename.php Will the filename.php be allowed then? Which order does google prioritize the lines? What will happen here for example?: Allow / Disallow / I am mainly referring to…
user188962
10
votes
3 answers

Sitemap for a site with a large number of dynamic subdomains

I'm running a site which allows users to create subdomains. I'd like to submit these user subdomains to search engines via sitemaps. However, according to the sitemaps protocol (and Google Webmaster Tools), a single sitemap can include URLs from a…
bartekb
  • 203
  • 6
  • 14
10
votes
2 answers

How to block search engines from indexing all urls beginning with origin.domainname.com

I have www.domainname.com, origin.domainname.com pointing to the same codebase. Is there a way, I can prevent all urls of basename origin.domainname.com from getting indexed. Is there some rule in robot.txt to do it. Both the urls are pointing to…
Loveleen Kaur
  • 993
  • 4
  • 16
  • 36
10
votes
2 answers

Should I use different case-spellings for case-insensitive directories in robots.txt?

Unfortunately, I’ve got case-insensitive servers that cannot be replaced in the short term. Some directories need to be excluded from crawling, so I have to Disallow them in my robots.txt. Let’s take /Img/ as example. If I keep it all lower…
dakab
  • 5,379
  • 9
  • 43
  • 67
10
votes
5 answers

Is it possible to list multiple user-agents in one line?

Is it possible in robots.txt to give one instruction to multiple bots without repeatedly having to mention it? Example: User-agent: googlebot yahoobot microsoftbot Disallow: /boringstuff/
elhombre
  • 2,839
  • 7
  • 28
  • 28
10
votes
5 answers

How to allow crawlers access to index.php only, using robots.txt?

If i want to only allow crawlers to access index.php, will this work? User-agent: * Disallow: / Allow: /index.php
todd
  • 101
  • 1
  • 1
  • 3
9
votes
1 answer

How to configure robots.txt file to block all but 2 directories

I don't want any search search engines to index most of my website. I do however want search engines to index 2 folders ( and their children ). This is what I set up, but I don't think it works, I see pages in Google that I wanted to hide: Here's…
jeph perro
  • 6,242
  • 26
  • 90
  • 124
9
votes
2 answers

Why does Chrome request a robots.txt?

I have noticed in my logs that Chrome requested a robots.txt alongside everything I expected it to. [...] 2017-09-17 15:22:35 - (sanic)[INFO]: Goin' Fast @ http://0.0.0.0:8080 2017-09-17 15:22:35 - (sanic)[INFO]: Starting worker [26704] 2017-09-17…
zython
  • 1,176
  • 4
  • 22
  • 50
9
votes
2 answers

Robots.txt file in MVC.NET 4

I have read an article about ignoring the robots from some url in my ASP MVC.NET project. In his article author said that we should add some action in some off controllers like this. In this example he adds the action to the Home Controller: #region…
Behzad Hassani
  • 2,129
  • 4
  • 30
  • 51
9
votes
1 answer

What does the dollar sign mean in robots.txt

I am curious about a website and want to do some web crawling at the /s path. Its robots.txt: User-Agent: * Allow: /$ Allow: /debug/ Allow: /qa/ Allow: /wiki/ Allow: /cgi-bin/loginpage Disallow: / My questions are: What does the dollar-sign mean…
夜一林风
  • 1,247
  • 1
  • 13
  • 24
9
votes
2 answers

Block bingbot from crawling my site

I would like t completely block bing from crawling my site for now (its attacking my site at an alarming rate (500GB of data a month). I have 1000 sub domains added to bing webmaster tools so i cant go and set each one's crawl rate. I have tried…
Zoinky
  • 4,083
  • 11
  • 40
  • 78
9
votes
1 answer

Nginx: different robots.txt for alternate domain

Summary I have a single web app with an internal and external domain pointing at it, and I want a robots.txt to block all access to the internal domain, but allow all access to the external domain. Problem Detail I have a simple Nginx server block…
Joe J
  • 9,985
  • 16
  • 68
  • 100