Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
0
votes
1 answer

How can I exclude certain paths from being crawled/indexed?

I have the following url structure on my website: user accounts: http://www.mydomain.com/username user my contain items that are under: http://www.mydomain.com/username/item/itemId What do I have to set in my robots.txt that the user accounts…
Michael
  • 32,527
  • 49
  • 210
  • 370
0
votes
2 answers

How to setup robots.txt on multi-site VPS

So I have a VPS (running debian) setup to host a number of sites i'm working on. with the structure like…
0
votes
1 answer

Will setting noindex/nofollow on parent pages affect site SEO for child pages?

This is a two part question. Each parent page to link to the first Child page. They parent pages will not have any content. They will serve as main menu links, site URL structure and site hierarchy. My website(wp) structure is as…
Nick Rivers
  • 294
  • 1
  • 6
  • 17
0
votes
2 answers

How to Add robots.txt to a Vaadin 7 Application with CDI-Integration?

how can I add a robots.txt file to a Vaadin application? I found nearly nothing related, but what I found states that there is no support for such a file. I'm using Vaadin 7.1.1 with JBoss 7.1.1 and Vaadin-CDI-Integration. My workaround approach…
aboger
  • 2,214
  • 6
  • 33
  • 47
0
votes
1 answer

How do I only allow crawlers to visit a part of the site?

I've got an ajax rich website which has extensive _escaped_fragment_ portions for Ajax indexing. While all my _escaped_fragment_ urls do 301 redirects to a special module which then outputs the HTML snapshots the crawlers need (i.e.…
Swader
  • 11,387
  • 14
  • 50
  • 84
0
votes
1 answer

Robots.txt http://example.com vs.http:// www.example.com

I have a situation where we have two code bases that need to stay intact.. example: http://example.com. And a new site http://www.example.com. The old site (no WWW) supports some legacy code and has the rule: User-agent: * Disallow: / But in the…
g00se0ne
  • 4,560
  • 2
  • 21
  • 14
0
votes
1 answer

Unlist a subdomain or directory according to robotstxt.org

According to robotstxt.org The first answer is a workaround: You could put all the files you don't want robots to visit in a separate sub directory, make that directory un-listable on the web (by configuring your server) How do I configure my…
EGHDK
  • 17,818
  • 45
  • 129
  • 204
0
votes
1 answer

how to deindex specific category in opencart through robots.txt file

Hello if i am not wrong robots.txt file will be this for opencart User-agent: * Disallow: /*&limit Disallow: /*&sort Disallow: /*?route=checkout/ Disallow: /*?route=account/ Disallow: /*?route=product/search Disallow: /*?route=affiliate/ Allow: / I…
0
votes
1 answer

Made changes to robots.txt but search engines still say description not available

Most of the questions I see are trying to hide the site from being indexed by search engines. For myself, I'm attempting the opposite. For the robots.txt file, I've put the following: # robots.txt User-agent: * Allow: / # End robots.txt…
Nina
  • 1,037
  • 10
  • 19
0
votes
2 answers

Need to block some URL from robots file

I would like to disallow some URLs in robots file of my website and have some difficulties. Right now my robots file has the following content: User-agent: * Allow: / Disallow: /cgi-bin/ Sitemap: http://seriesgate.tv/sitemap.xml I do not want…
alikarimi
  • 1
  • 1
0
votes
1 answer

How to prohibit bot access to physical location of robots.txt for multi-site?

If I have the following in my .htaccess: (disallow bots from going to /dir1/dir2) Disallow: /dir1/dir2 And I have in my .htaccess: (when accessing robots.txt, pipe them the data from dir1/dir2/robots.txt) RewriteCond %{HTTP_HOST}…
Lakitu
  • 424
  • 1
  • 4
  • 12
0
votes
1 answer

Allow crawling of only the home page of a sub-directory using robots.txt

I have www.example.com with WordPress and www.example.com/sitetwo with another WordPress I would allow crawling for the entire example.com and only the home page of example.com/sitetwo. What I have to write in my robots.txt?
michele
  • 26,348
  • 30
  • 111
  • 168
0
votes
2 answers

Grails Files in Root not found

I have a Grails app and want to make a robots.txt and sitemap.xml file. I read that the best way to put them into the application is in the web-app folder. When I run the site locally and test http://mysite/app/robots.txt everything works, but when…
skaz
  • 21,962
  • 20
  • 69
  • 98
0
votes
1 answer

Robots interpreting script tags

Our web application is currently crawled by a multitude of robots. However, some of them seem to try and parse javascript tags and interpret some of it as links, which are called and fill our error log with loads of 404s. On our pages we have…
Thomas
  • 87,414
  • 12
  • 119
  • 157
0
votes
1 answer

Best way to prevent Google from indexing a directory

I've researched many methods on how to prevent Google/other search engines from crawling a specific directory. The two most popular ones I've seen are: Adding it into the robots.txt file: Disallow: /directory/ Adding a meta tag:
user2154729
  • 97
  • 1
  • 9