Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
0
votes
2 answers

Bingbot ignoring robots.txt and attempting to retrieve a trafficbasedsspsitemap.xml

I have an app whose content should not be publicly indexed. I've therefore disallowed access to all crawlers. robots.txt: # Robots shouldn't index a private app. User-agent: * Disallow: / However, Bing has been ignoring this and daily requests a…
Brad Koch
  • 19,267
  • 19
  • 110
  • 137
0
votes
3 answers

Will / in robots.txt also apply to the root directory?

In website.com/path/ there is a robots.txt file, that contains the following: User-agent: * Disallow: / I do NOT want it to apply for website.com, but only to the path itself. The question is: does / actually mean ./ or does it refer to the web…
bizna
  • 702
  • 8
  • 23
0
votes
1 answer

Html - PHP - Robots

I am just looking for a bit of advice / feedback, I was thinking about setting up and opencart behind an HTML site (shop) that gets ranked well in Google. The index.html site appears instead of the index.php page by default on the web server (I have…
0
votes
1 answer

Why Google add a notice "this site may be compromised"

This morning, a lot of my website where tagged "this site may be compromised" by Google in it's result. Sites that are under my supervision on my own VPS server. I'ved run a deep scan on it and nothing unsual. I'ved look for suspicious htaccess and…
Jaune Citron
  • 319
  • 3
  • 13
0
votes
1 answer

How to block selecting links containing?

What code do I add to Robots.txt to allow: ?app=core&module=global§ion=sitemap&sitemap=sitemap_core_topics_4.xml.gz But block all other links containing ? Current Code: Disallow: /*? blocks the above link(s) containing the keyword sitemap.
0
votes
1 answer

Blocking Links From Being Indexed in Opencart

I seem to be receiving a lot of 404 errors from Google Webmaster tools of late. Is there a way to prohibit these sort of links from being indexed? and where do they come from? Appreciate if you could shed some light on this matter. Thanks in…
bernie
  • 501
  • 1
  • 6
  • 12
0
votes
1 answer

Disallow files without extension

I am currently doing this with htaccess RewriteRule ^view$ view.php [L] And I link to /view without the extension and it works fine. But how do I disable robots from indexing /view? in the robots.txt I put Disallow: /view.php Disallow: /view When…
ramo
  • 945
  • 4
  • 12
  • 20
0
votes
3 answers

Nginx block robots.txt file

I am running Nginx 1.1.19 on an Ubuntu server 12.04 and I'm having trouble doing the Googlebot, see the robots.txt file. I used the examples this post, but I did not get success. To test the service, I access the Webmaster Tools, click on "Integrity…
hdegenaro
  • 131
  • 1
  • 4
0
votes
1 answer

How To Use a Wildcard in robots.txt

Is it possible to: User-agent: * Disallow: /apps/abc*/ In a robots.txt file to disallow abc123, abc-xyz, etc.?
H. Ferrence
  • 7,906
  • 31
  • 98
  • 161
0
votes
1 answer

Bots throws 500 error in apache access log

In my Apache error log I can see the following errors has caught on enormous amount everyday. [Tue Jan 15 13:37:39 2013] [error] [client 66.249.78.53] Request exceeded the limit of 10 internal redirects due to probable configuration error. Use…
FR STAR
  • 662
  • 4
  • 24
  • 50
0
votes
1 answer

robots.txt Disallow: /click What is disallowed?

I would like to scrape a web site. It has the following in it's robots.txtfile, but I'm not exactly sure what it is they don't want me to do: User-agent: * Disallow: /click There is no click subdirectory. Or they don't want me to access anything…
user984003
  • 28,050
  • 64
  • 189
  • 285
0
votes
1 answer

How to noindex AJAX-loaded pages without a head tag?

On a client's website there are a series of tutor popups that are located on separate pages. http://launcheducation.com/ The thing is that they do not need to be noindexed, they're just popups on the main page. If you load the links directly they…
JaidynReiman
  • 954
  • 3
  • 11
  • 22
0
votes
3 answers

Hide a specific folder and it's sub folders and files ?

I want to hide a folder named ( beta ) in the public_html from the search engines also all it's subfolders and files, do i have to put the file in the root folder ( / ) and do the content of the robots.txt like the following User-agent: * Disallow:…
osos
  • 2,103
  • 5
  • 28
  • 42
0
votes
1 answer

How to block all URLs in robots.txt except for the directory indexes?

For example, I want to block /foo.php, /foo/foo.php and every other similar URL in robots.txt, only leaving /, /foo/, etc. behind. In other words, I want to block everything except the directories. How is this possible?
Lucas
  • 16,930
  • 31
  • 110
  • 182
0
votes
2 answers

Disallow pagination from being indexed

I have a joomla website with over 1000 pages that contain urls like this: www.mysite.com/example.html?start=10 www.mysite.com/example.html?start=20 www.mysite.com/example.html?limitstart=0 All this URL are indexed by google, in google…
BerrKamal
  • 89
  • 1
  • 10