Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
0
votes
1 answer

robots.txt remove entire subdomain/directory

I have a subdomain forums.example.com It is in public_html/forums If I put the following robots.txt in public_html/forums will it remove all the forums from the index? (I migrated forums to a different provider and want to remove all the forum pages…
Chris Muench
  • 17,444
  • 70
  • 209
  • 362
0
votes
1 answer

Avoiding one domain to be indexed by search engines

I have a site that is available through 2 domains. One domain is one that I got for free with a hosting plan and don't want to promote. However when I perform search queries, pages from my site on that domain pop up. What techniques are there to…
Koen
  • 3,626
  • 1
  • 34
  • 55
0
votes
1 answer

PHP Error[2]: fopen(http://www1.macys.com/robots.txt)

I am trying to download the contents of the robots.txt file My original problem link: PHP file_exists() for URL/robots.txt returns false this is the line 22: $f = fopen($file, 'r'); I get this error : PHP Error[2]:…
Ionut Flavius Pogacian
  • 4,750
  • 14
  • 58
  • 100
0
votes
1 answer

Make PHP Web Crawler to Respect the robots.txt file of any website

I have developed a Web Crawler and now i want to respect the robots.txt file of the websites that i am crawling. I see that this is the robots.txt file structure: User-agent: * Disallow: /~joe/junk.html Disallow: /~joe/foo.html Disallow:…
Ionut Flavius Pogacian
  • 4,750
  • 14
  • 58
  • 100
0
votes
3 answers

Google Webmaster is not Accepting my Sitemap

After setting my Google Webmaster account and verified my website, i failed to add my sitemap to it. It was issuing the following error. I tried to do the following: I removed the robots.txt and still didn't work. I tried to verify my sitemap…
CompilingCyborg
  • 4,760
  • 13
  • 44
  • 61
0
votes
2 answers

Selectively indexing subdomains

I am working on Web application, which allows users to create their own webapp in turn. For each new webapp created by my application I Assign a new Subdomain. e.g. subdomain1.xyzdomain.com, subdomain2.xyzdomain.com etc. All these Webapps are stored…
lalit
  • 3,283
  • 2
  • 19
  • 26
0
votes
1 answer

Wordpress function to update or create robots.txt

I am making a plugin for Wordpress with a function to update the file robots.txt or to create it if not existing yet. So far I have this function: function roots_robots() { echo "Disallow: /cgi-bin\n"; echo "Disallow: /wp-admin\n"; echo…
user1482757
  • 11
  • 1
  • 1
  • 4
0
votes
2 answers

How to stop robots crawling pagination using robots.txt?

I have various paginations on my site and I want to stop google and other search engines crawling the index of my paginations. Example of a crawled page: http://www.mydomain.com/explore/recently-updated/index/12 How can I, using robots.txt deny…
harrynortham
  • 299
  • 2
  • 4
  • 13
0
votes
1 answer

No Robots robots.txt Location

A bit confused with robots.txt. Say I wanted to block robots on a site on a Linux based Apache server in location: var/www/mySite I would place robots.txt in that directory (alongside index.php) containing this: User-agent: * Disallow:…
Adam Waite
  • 19,175
  • 22
  • 126
  • 148
0
votes
2 answers

updating the robots

Ok so I want to add this User-Agent: * Disallow: / to the robots.txt in all the enviroments other then production...any idea on the best want to do this. Should i remove it from the public folder and create a routes/views I am using rails 3.0.14…
Matt Elhotiby
  • 43,028
  • 85
  • 218
  • 321
0
votes
2 answers

robots.txt exclude urls which contain specific part

I have urls like this: www.wunderwedding.com/weddingvenues/share-weddingvenue/175/beachclub-all-good www.wunderwedding.com/weddingvenues/share-weddingvenue/2567/castle-rock Since these urls no longer exist, I want to disallow googlebot via…
Adam
  • 6,041
  • 36
  • 120
  • 208
0
votes
0 answers

htaccess, how to allow access to a file by only search engines or bots

I have a sitemap file called links.txt, and I want only search engine/bots to access this file. How can I do that via htaccess file.
Hamza
  • 1,087
  • 4
  • 21
  • 47
0
votes
1 answer

Block google from indexing some pages from site

I have a problem with lots of 404 errors on one site. I figured out that these errors are happening because google is trying to find pages that no longer exist. Now I need to tell Google not to index those pages again. I found some solutions on the…
0
votes
1 answer

Kohana: What is the best way to serve robots.txt?

In our web app we're redirecting all 404's to a pretty error page, but for robots.txt we need to server a default page (or return 404), else google won't index us. Should I be adding a route to bootstrap.php specifically for…
David Parks
  • 30,789
  • 47
  • 185
  • 328
0
votes
1 answer

Avoid or block all load balanced sites from being crawled

We have an Umbraco site in a load balanced environment and we need to make sure only the actual URL gets crawled and not the different production URLs. We only want example.com to be indexed while load balancers at production1.example.com and…
Ingen Speciell
  • 107
  • 2
  • 13