Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
0
votes
1 answer

Robots Text Blocked

header("Content-Type: text/plain; charset=utf-8"); if ($_SERVER['SERVER_PORT'] == 443) { echo "User-agent: *\n" ; echo "Disallow: /\n" ; } else { echo "User-agent: *\n" ; echo "Disallow: \n" ; } What does this code do in robots.php? I found it on…
0
votes
1 answer

Regarding robots.txt file in web applications

I am using Tomcat 5.5 server and there is a web application deployed. I want to block http requests which access the .txt files in my project. For example http urls like -- https ://MyDomain/inside/mytest.txt I think this can be done using…
user496934
  • 3,822
  • 10
  • 45
  • 64
0
votes
1 answer

Should I use #End Robots# or not in robots.txt?

Should I use #End Robots# or not in robots.txt? I mean does it help to prevent me from getting whitespaces or not? right now it looks something like that: User-agent: * Disallow: /admin/ Disallow: /account/ Disallow: /access-denied/ #End Robots#
Derfder
  • 3,204
  • 11
  • 50
  • 85
0
votes
1 answer

Disable googlebot fetching www

I have www redirect in .htaccess. So www.example.com gets 301 redirect to example.com But google still tries to fetch www.example.com also. Can i disable googlebot fetcing www.example.com? Eg from webaster tools or robots.txt?
Kristian
  • 3,283
  • 3
  • 28
  • 52
0
votes
1 answer

Using a pixel to ignore tracking of bots?

On my site I need to record some data about visitors but it's important I record only human visitors and not any bots, especially bad bots and automatically (capcha and such are not an option). The way it's done now is running a JS that adds…
SimonW
  • 6,175
  • 4
  • 33
  • 39
0
votes
3 answers

What does this command in robots.txt do?

I am wondering what the following code in Robots.txt does. User-agent: * Disallow: /*? Any ideas?
Elvin
  • 367
  • 3
  • 5
  • 16
0
votes
2 answers

Prevent Search Spiders from accessing a Rails 3 nested resource with robots.txt

I'm trying to prevent Google, Yahoo et al from hitting my /products/ID/purchase page and am unsure on how to do it. I currently block them from hitting sign in with the following: User-agent: * Disallow: /sign_in Can I do something like the…
Gerard
  • 4,818
  • 5
  • 51
  • 80
0
votes
1 answer

Will search engines honor robots.txt for a separate site that is a virtual directory under another site?

I have a website (Ex: www.examplesite.com), and I am creating another site as a separate, stand-alone site in IIS. This second site's URL will make it look like it's part of my main site: www.examplesite.com/anothersite. This is accomplished by…
maguidhir
  • 167
  • 8
0
votes
1 answer

Block specific file types from google search

I want to block XML files from Google bot except sitemap.XML. I am using Lazyest Gallery for my WordPress image gallery. Every gallery folder have a XML file containing the details of images. The problem is, now Google index those XML files instead…
Haris
  • 162
  • 10
0
votes
2 answers

Minimum Delay betweeen consecutive request to server by a web crawler

I have built a multi threaded web crawler which makes requests to fetch the web pages from corresponding servers. As it is multi threaded it can make overburden a server. Due to which server can block the crawler(politeness). I just want to add…
Prannoy Mittal
  • 1,525
  • 5
  • 21
  • 32
0
votes
1 answer

What does "Disallow: /*?" mean in Twitter robots.txt?

Here's the section for every other bot besides Google and co. # Every bot that might possibly read and respect this file. User-agent: * Allow: /search Disallow: /search/users Disallow: /search/*/grid Disallow: /*? Disallow:…
Raz
  • 191
  • 3
  • 7
0
votes
2 answers

robots.txt disallow /variable_dir_name/directory

I need to disallow /variable_dir_name/directory via robots.txt I use: Disallow: */directory Noindex: */directory is that correct?
KenAdamsTR
  • 145
  • 1
  • 1
  • 9
0
votes
2 answers

Prevent indexing of sub-directories of parent domain

Say my site children.com ( which I want indexed ) is also accessible via http://mother.com/children/ ( which I don't want indexed ). Example hierarchy: /home/username/mother : http://mother.com |_ children : http://www.children.com What would I put…
Wes
  • 79
  • 2
  • 10
0
votes
1 answer

Robots.txt exclusion pattern

I'm looking to ignore all URLs from bingbot that contain a query string variable at1= but i'm not clear on the pattern that should be used. Should one of these work? User-agent: bingbot Disallow: /*at1= Disallow: *at1=* To confirm, I could have a…
robjmills
  • 18,438
  • 15
  • 77
  • 121
0
votes
1 answer

Hit rate limitation in nutch

Is it possible to limit the hit rate/IP address in nutch? In other words, can I configure nutch so that it will only hit an IP x number of times per hour, etc.?
Enno Shioji
  • 26,542
  • 13
  • 70
  • 109