Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions

votes

1 answer

robots.txt remove entire subdomain/directory

I have a subdomain forums.example.com It is in public_html/forums If I put the following robots.txt in public_html/forums will it remove all the forums from the index? (I migrated forums to a different provider and want to remove all the forum pages…

robots.txt

asked Aug 25 '12 at 02:14

Chris Muench

17,444
70
209
362

votes

1 answer

Avoiding one domain to be indexed by search engines

I have a site that is available through 2 domains. One domain is one that I got for free with a hosting plan and don't want to promote. However when I perform search queries, pages from my site on that domain pop up. What techniques are there to…

seo robots.txt

asked Aug 22 '12 at 16:27

Koen

3,626
1
34
55

votes

1 answer

PHP Error[2]: fopen(http://www1.macys.com/robots.txt)

I am trying to download the contents of the robots.txt file My original problem link: PHP file_exists() for URL/robots.txt returns false this is the line 22: $f = fopen($file, 'r'); I get this error : PHP Error[2]:…

php robots.txt

asked Aug 15 '12 at 11:31

Ionut Flavius Pogacian

4,750
14
58
100

votes

1 answer

Make PHP Web Crawler to Respect the robots.txt file of any website

I have developed a Web Crawler and now i want to respect the robots.txt file of the websites that i am crawling. I see that this is the robots.txt file structure: User-agent: * Disallow: /~joe/junk.html Disallow: /~joe/foo.html Disallow:…

robots.txt robot

asked Aug 14 '12 at 13:45

Ionut Flavius Pogacian

4,750
14
58
100

votes

3 answers

Google Webmaster is not Accepting my Sitemap

After setting my Google Webmaster account and verified my website, i failed to add my sitemap to it. It was issuing the following error. I tried to do the following: I removed the robots.txt and still didn't work. I tried to verify my sitemap…

seo sitemap robots.txt google-search-console

asked Aug 10 '12 at 14:22

CompilingCyborg

4,760
13
44
61

votes

2 answers

Selectively indexing subdomains

I am working on Web application, which allows users to create their own webapp in turn. For each new webapp created by my application I Assign a new Subdomain. e.g. subdomain1.xyzdomain.com, subdomain2.xyzdomain.com etc. All these Webapps are stored…

python seo indexing robots.txt googlebot

asked Aug 06 '12 at 13:34

lalit

3,283
2
19
26

votes

1 answer

Wordpress function to update or create robots.txt

I am making a plugin for Wordpress with a function to update the file robots.txt or to create it if not existing yet. So far I have this function: function roots_robots() { echo "Disallow: /cgi-bin\n"; echo "Disallow: /wp-admin\n"; echo…

wordpress function robots.txt

asked Jul 14 '12 at 23:11

user1482757

votes

2 answers

How to stop robots crawling pagination using robots.txt?

I have various paginations on my site and I want to stop google and other search engines crawling the index of my paginations. Example of a crawled page: http://www.mydomain.com/explore/recently-updated/index/12 How can I, using robots.txt deny…

seo pagination robots.txt

asked Jul 13 '12 at 19:35

harrynortham

votes

1 answer

No Robots robots.txt Location

A bit confused with robots.txt. Say I wanted to block robots on a site on a Linux based Apache server in location: var/www/mySite I would place robots.txt in that directory (alongside index.php) containing this: User-agent: * Disallow:…

linux apache ubuntu web robots.txt

asked Jul 03 '12 at 10:31

Adam Waite

19,175
22
126
148

votes

2 answers

updating the robots

Ok so I want to add this User-Agent: * Disallow: / to the robots.txt in all the enviroments other then production...any idea on the best want to do this. Should i remove it from the public folder and create a routes/views I am using rails 3.0.14…

ruby-on-rails ruby ruby-on-rails-3 routes robots.txt

asked Jul 02 '12 at 23:32

Matt Elhotiby

43,028
85
218
321

votes

2 answers

robots.txt exclude urls which contain specific part

I have urls like this: www.wunderwedding.com/weddingvenues/share-weddingvenue/175/beachclub-all-good www.wunderwedding.com/weddingvenues/share-weddingvenue/2567/castle-rock Since these urls no longer exist, I want to disallow googlebot via…

robots.txt

asked Jun 23 '12 at 09:14

Adam

6,041
36
120
208

votes

0 answers

htaccess, how to allow access to a file by only search engines or bots

I have a sitemap file called links.txt, and I want only search engine/bots to access this file. How can I do that via htaccess file.

.htaccess mod-rewrite robots.txt

asked Jun 22 '12 at 13:11

Hamza

1,087
4
21
47

votes

1 answer

Block google from indexing some pages from site

I have a problem with lots of 404 errors on one site. I figured out that these errors are happening because google is trying to find pages that no longer exist. Now I need to tell Google not to index those pages again. I found some solutions on the…

indexing robots.txt

asked Jun 01 '12 at 16:05

abraovic

votes

1 answer

Kohana: What is the best way to serve robots.txt?

In our web app we're redirecting all 404's to a pretty error page, but for robots.txt we need to server a default page (or return 404), else google won't index us. Should I be adding a route to bootstrap.php specifically for…

php web-applications kohana robots.txt kohana-3.2

asked May 20 '12 at 08:09

David Parks

30,789
47
185
328

votes

1 answer

Avoid or block all load balanced sites from being crawled

We have an Umbraco site in a load balanced environment and we need to make sure only the actual URL gets crawled and not the different production URLs. We only want example.com to be indexed while load balancers at production1.example.com and…

web-crawler umbraco load-balancing robots.txt

asked May 18 '12 at 02:39

Ingen Speciell

Prev 1 2 3

…

95 96 Next