Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions

votes

1 answer

Robots Text Blocked

header("Content-Type: text/plain; charset=utf-8"); if ($_SERVER['SERVER_PORT'] == 443) { echo "User-agent: *\n" ; echo "Disallow: /\n" ; } else { echo "User-agent: *\n" ; echo "Disallow: \n" ; } What does this code do in robots.php? I found it on…

robots.txt robot

asked Dec 24 '12 at 00:00

user1925594

votes

1 answer

Regarding robots.txt file in web applications

I am using Tomcat 5.5 server and there is a web application deployed. I want to block http requests which access the .txt files in my project. For example http urls like -- https ://MyDomain/inside/mytest.txt I think this can be done using…

tomcat web-applications robots.txt tomcat5.5

asked Dec 12 '12 at 07:20

user496934

3,822
10
45
64

votes

1 answer

Should I use #End Robots# or not in robots.txt?

Should I use #End Robots# or not in robots.txt? I mean does it help to prevent me from getting whitespaces or not? right now it looks something like that: User-agent: * Disallow: /admin/ Disallow: /account/ Disallow: /access-denied/ #End Robots#

robots.txt

asked Dec 07 '12 at 11:28

Derfder

3,204
11
50
85

votes

1 answer

Disable googlebot fetching www

I have www redirect in .htaccess. So www.example.com gets 301 redirect to example.com But google still tries to fetch www.example.com also. Can i disable googlebot fetcing www.example.com? Eg from webaster tools or robots.txt?

robots.txt googlebot

asked Dec 04 '12 at 07:00

Kristian

3,283
3
28
52

votes

1 answer

Using a pixel to ignore tracking of bots?

On my site I need to record some data about visitors but it's important I record only human visitors and not any bots, especially bad bots and automatically (capcha and such are not an option). The way it's done now is running a JS that adds…

javascript analytics pixel bots robots.txt

asked Nov 28 '12 at 15:49

SimonW

6,175
4
33
39

votes

3 answers

What does this command in robots.txt do?

I am wondering what the following code in Robots.txt does. User-agent: * Disallow: /*? Any ideas?

search-engine robots.txt

asked Nov 07 '12 at 08:21

Elvin

votes

2 answers

Prevent Search Spiders from accessing a Rails 3 nested resource with robots.txt

I'm trying to prevent Google, Yahoo et al from hitting my /products/ID/purchase page and am unsure on how to do it. I currently block them from hitting sign in with the following: User-agent: * Disallow: /sign_in Can I do something like the…

ruby-on-rails-3 robots.txt

asked Oct 31 '12 at 10:07

Gerard

4,818
5
51
80

votes

1 answer

Will search engines honor robots.txt for a separate site that is a virtual directory under another site?

I have a website (Ex: www.examplesite.com), and I am creating another site as a separate, stand-alone site in IIS. This second site's URL will make it look like it's part of my main site: www.examplesite.com/anothersite. This is accomplished by…

robots.txt

asked Oct 25 '12 at 21:16

maguidhir

votes

1 answer

Block specific file types from google search

I want to block XML files from Google bot except sitemap.XML. I am using Lazyest Gallery for my WordPress image gallery. Every gallery folder have a XML file containing the details of images. The problem is, now Google index those XML files instead…

block robots.txt

asked Oct 25 '12 at 15:09

Haris

votes

2 answers

Minimum Delay betweeen consecutive request to server by a web crawler

I have built a multi threaded web crawler which makes requests to fetch the web pages from corresponding servers. As it is multi threaded it can make overburden a server. Due to which server can block the crawler(politeness). I just want to add…

java web webserver web-crawler robots.txt

asked Oct 10 '12 at 10:16

Prannoy Mittal

1,525
5
21
32

votes

1 answer

What does "Disallow: /*?" mean in Twitter robots.txt?

Here's the section for every other bot besides Google and co. # Every bot that might possibly read and respect this file. User-agent: * Allow: /search Disallow: /search/users Disallow: /search/*/grid Disallow: /*? Disallow:…

twitter robots.txt

asked Oct 06 '12 at 15:37

Raz

votes

2 answers

robots.txt disallow /variable_dir_name/directory

I need to disallow /variable_dir_name/directory via robots.txt I use: Disallow: */directory Noindex: */directory is that correct?

directory robots.txt noindex

asked Oct 02 '12 at 10:28

KenAdamsTR

votes

2 answers

Prevent indexing of sub-directories of parent domain

Say my site children.com ( which I want indexed ) is also accessible via http://mother.com/children/ ( which I don't want indexed ). Example hierarchy: /home/username/mother : http://mother.com |_ children : http://www.children.com What would I put…

seo robots.txt

asked Sep 19 '12 at 18:36

Wes

votes

1 answer

Robots.txt exclusion pattern

I'm looking to ignore all URLs from bingbot that contain a query string variable at1= but i'm not clear on the pattern that should be used. Should one of these work? User-agent: bingbot Disallow: /*at1= Disallow: *at1=* To confirm, I could have a…

seo robots.txt

asked Sep 13 '12 at 11:31

robjmills

18,438
15
77
121

votes

1 answer

Hit rate limitation in nutch

Is it possible to limit the hit rate/IP address in nutch? In other words, can I configure nutch so that it will only hit an IP x number of times per hour, etc.?

web-crawler nutch robots.txt

asked Sep 10 '12 at 13:03

Enno Shioji

26,542
13
70
109

Prev 1 2 3

…

95 96 Next