Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions

votes

2 answers

Bingbot ignoring robots.txt and attempting to retrieve a trafficbasedsspsitemap.xml

I have an app whose content should not be publicly indexed. I've therefore disallowed access to all crawlers. robots.txt: # Robots shouldn't index a private app. User-agent: * Disallow: / However, Bing has been ignoring this and daily requests a…

bing robots.txt bingbot

asked Apr 03 '13 at 19:16

Brad Koch

19,267
19
110
137

votes

3 answers

Will / in robots.txt also apply to the root directory?

In website.com/path/ there is a robots.txt file, that contains the following: User-agent: * Disallow: / I do NOT want it to apply for website.com, but only to the path itself. The question is: does / actually mean ./ or does it refer to the web…

robots.txt

asked Mar 24 '13 at 11:33

bizna

votes

1 answer

Html - PHP - Robots

I am just looking for a bit of advice / feedback, I was thinking about setting up and opencart behind an HTML site (shop) that gets ranked well in Google. The index.html site appears instead of the index.php page by default on the web server (I have…

php opencart robots.txt google-index

asked Mar 19 '13 at 11:00

user1882651

votes

1 answer

Why Google add a notice "this site may be compromised"

This morning, a lot of my website where tagged "this site may be compromised" by Google in it's result. Sites that are under my supervision on my own VPS server. I'ved run a deep scan on it and nothing unsual. I'ved look for suspicious htaccess and…

.htaccess webserver robots.txt

asked Mar 07 '13 at 21:15

Jaune Citron

votes

1 answer

How to block selecting links containing?

What code do I add to Robots.txt to allow: ?app=core&module=global§ion=sitemap&sitemap=sitemap_core_topics_4.xml.gz But block all other links containing ? Current Code: Disallow: /*? blocks the above link(s) containing the keyword sitemap.

robots.txt invision-power-board

asked Feb 28 '13 at 22:03

OperaManiac

votes

1 answer

Blocking Links From Being Indexed in Opencart

I seem to be receiving a lot of 404 errors from Google Webmaster tools of late. Is there a way to prohibit these sort of links from being indexed? and where do they come from? Appreciate if you could shed some light on this matter. Thanks in…

.htaccess opencart robots.txt

asked Feb 18 '13 at 07:29

bernie

votes

1 answer

Disallow files without extension

I am currently doing this with htaccess RewriteRule ^view$ view.php [L] And I link to /view without the extension and it works fine. But how do I disable robots from indexing /view? in the robots.txt I put Disallow: /view.php Disallow: /view When…

.htaccess robots.txt

asked Feb 16 '13 at 01:20

ramo

votes

3 answers

Nginx block robots.txt file

I am running Nginx 1.1.19 on an Ubuntu server 12.04 and I'm having trouble doing the Googlebot, see the robots.txt file. I used the examples this post, but I did not get success. To test the service, I access the Webmaster Tools, click on "Integrity…

nginx robots.txt

asked Feb 14 '13 at 20:02

hdegenaro

votes

1 answer

How To Use a Wildcard in robots.txt

Is it possible to: User-agent: * Disallow: /apps/abc*/ In a robots.txt file to disallow abc123, abc-xyz, etc.?

robots.txt

asked Feb 07 '13 at 16:29

H. Ferrence

7,906
31
98
161

votes

1 answer

Bots throws 500 error in apache access log

In my Apache error log I can see the following errors has caught on enormous amount everyday. [Tue Jan 15 13:37:39 2013] [error] [client 66.249.78.53] Request exceeded the limit of 10 internal redirects due to probable configuration error. Use…

apache .htaccess bots robots.txt

asked Jan 16 '13 at 14:22

FR STAR

votes

1 answer

robots.txt Disallow: /click What is disallowed?

I would like to scrape a web site. It has the following in it's robots.txtfile, but I'm not exactly sure what it is they don't want me to do: User-agent: * Disallow: /click There is no click subdirectory. Or they don't want me to access anything…

web-crawler robots.txt

asked Jan 15 '13 at 22:30

user984003

28,050
64
189
285

votes

1 answer

How to noindex AJAX-loaded pages without a head tag?

On a client's website there are a series of tutor popups that are located on separate pages. http://launcheducation.com/ The thing is that they do not need to be noindexed, they're just popups on the main page. If you load the links directly they…

ajax wordpress seo robots.txt noindex

asked Jan 10 '13 at 21:54

JaidynReiman

votes

3 answers

Hide a specific folder and it's sub folders and files ?

I want to hide a folder named ( beta ) in the public_html from the search engines also all it's subfolders and files, do i have to put the file in the root folder ( / ) and do the content of the robots.txt like the following User-agent: * Disallow:…

seo robots.txt

asked Jan 10 '13 at 08:59

osos

2,103
5
28
42

votes

1 answer

How to block all URLs in robots.txt except for the directory indexes?

For example, I want to block /foo.php, /foo/foo.php and every other similar URL in robots.txt, only leaving /, /foo/, etc. behind. In other words, I want to block everything except the directories. How is this possible?

robots.txt

asked Jan 03 '13 at 00:18

Lucas

16,930
31
110
182

votes

2 answers

Disallow pagination from being indexed

I have a joomla website with over 1000 pages that contain urls like this: www.mysite.com/example.html?start=10 www.mysite.com/example.html?start=20 www.mysite.com/example.html?limitstart=0 All this URL are indexed by google, in google…

joomla pagination robots.txt

asked Dec 27 '12 at 20:43

BerrKamal

Prev 1 2 3

…

95 96 Next