Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
0
votes
1 answer

Evidences for automatic browsing-Log file analysis

I'm not quite sure whether this is the suitable forum to post my question. I'm analyzing web server logs both in Apache and IIS log formats. I want to find the evidences for automatic browsing(Ex. Web robots,spiders,bots etc.) I used python…
Nilani Algiriyage
  • 32,876
  • 32
  • 87
  • 121
0
votes
2 answers

Having problems understanding how to block some URLs on robot.txt

The problem is this. I have some URLs on the system I have that have this pattern http://foo-editable.mydomain.com/menu1/option2 http://bar-editable.mydomain.com/menu3/option1 I would like to indicate in the robot.txt file that they should not be…
paddingtonMike
  • 1,441
  • 1
  • 21
  • 37
0
votes
1 answer

Drupal Aegir - Symlinked files directory and multisite robots.txt

I'm using Aegir/Barracuda/Nginx to maintain a multisite setup. My "files" directory is symlinked to a mounted "files" directory. Therefore when I clone a site to be used for dev purposes it uses the same "files" directory. The problem with the…
Meggy
  • 1,491
  • 3
  • 28
  • 63
0
votes
1 answer

Googlebot and Bingbot crawling DNN site

I have a DNN site with over 20,000 pages. The Googlebot and Bingbot are consistently crawling my website. When I look at my sitelog I can see that google and bing are crawling my site via the pageid (ex: www.url.com/Default.aspx?TabID=5000) The…
Cesar
  • 139
  • 2
  • 15
0
votes
1 answer

Can I use robots.txt to send the robots to a specific folder?

So I have a regular website and a blog in the same domain. In the future I plan on buying a domain exclusively for the blog but for now this is the way I'll do it. The blog is in the directory /blog and there are no links from the main site to the…
Pier
  • 10,298
  • 17
  • 67
  • 113
0
votes
2 answers

Blocking URLs that contain numbers in robots.txt

My website allows search engines to index the same page in 2 formats like: ‪www.example.com/page-1271.html‬ www.example.com/page-1271-page-title.html All my site pages are like that. So, How can I block the first format in robots.txt file? I mean…
hatem tawfik
  • 21
  • 1
  • 1
  • 4
0
votes
2 answers

how to block multiple links in robot.txt with one line?

I have many pages whose links are as follow: http://site.com/school_flower/ http://site.com/school_rose/ http://site.com/school_pink/ etc. I can't block them manually. How could i block these kind of pages, while i have hundreds fo links of above…
0
votes
0 answers

Editing robot.txt files in wp

XML sitemap generator plugin simply put the following string in robot.txt file, if we see so many wp blogs they have lots of tags included in it. also my xml file looks like "sitemap.xml.gz" this, User-agent: * Disallow: /wp-admin/ Disallow:…
Naruto
  • 9,476
  • 37
  • 118
  • 201
0
votes
1 answer

robots.txt codes for exclude several dir in one dir

I want Disallow google images to index my images in these path please let me know that am i right for this code in robots.txt. /images/otherimages/dir1/here are several images /images/otherimages/dir2/here are several images User-agent:…
Kaveh
  • 2,530
  • 7
  • 29
  • 34
0
votes
0 answers

RewriteRule causing redirection

I have IIS 7.5 with ISAPI_Rewrite(Helicon) I'm trying to do so that the robots.txt from each hosted site will be the same. For that purpose I have one dummy site(sometestsite.com) which has robots1.txt(which I want to be reused on each other…
Vladimirs
  • 8,232
  • 4
  • 43
  • 79
0
votes
0 answers

UAT site is not searchable(crawable)

We have production and test environment as any other company. And I was thinking to put a robots.txt into the UAT root folder that Google web crawler would not do an unwarranted crawl on the uat pages. But what I found out was surprising. I do not…
Lost
  • 12,007
  • 32
  • 121
  • 193
0
votes
1 answer

Avoid robots from going into a www.domain.com/thishash when link posted to twitter, facebook

I'm building a service where people gets notified (mails) when they follow a link with the format www.domain.com/this_is_a_hash. The people that use this server can share this link on different places like, twitter, tumblr, facebook and more... The…
Andres
  • 11,439
  • 12
  • 48
  • 87
0
votes
2 answers

blocked links in sitemap

i'm using a online sitemap generator tool which generates links even for which are blocked in robots.txt. Is these blocked links affect site ranking ? . Is there anyway to overcome it ?
ArK
  • 20,698
  • 67
  • 109
  • 136
0
votes
1 answer

Disallow subdomain url using robots.txt

i would like to ask you a question... i have a domain kiosban.com and store.kiosban.com.. and i want to disallow store.kiosban.com/template/* And i have this on my store.kiosban.com/robots.txt but when i look at google webmaster tools... on health…
0
votes
1 answer

How to properly split a site?

Suppose I have a new verison of a website: http://www.mywebsite.com and I have would like to keep the older site in a sub-directory and treat it seperately: http://www.mywebsite.com/old/ My new site has a link to the old one on the main page,…
Maximus
  • 1,441
  • 14
  • 38