Questions tagged [robots.txt]

Convention to prevent webcrawlers from indexing your website.

If a site owner wishes to give instructions to web robots they must place a text file called robots.txt in the root of the web site hierarchy (e.g. www.example.com/robots.txt). This text file should contain the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the web site. If this file doesn't exist web robots assume that the web owner wishes to provide no specific instructions.

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

For websites with multiple subdomains, each subdomain must have its own robots.txt file. If example.com had a robots.txt file but a.example.com did not, the rules that would apply for example.com would not apply to a.example.com.

Source: wikipedia

86 questions
0
votes
0 answers

With Nginx / Node.js reverse proxy how does Nginx serve robots.txt despite txt files not being referenced in Nginx config's location blocks?

In Chrome when I enter https://www.example.com/robots.txt my robots.txt file is served and works fine. I'm happy that it works but I'm not sure why it does. In the config below I thought that my last location block, location / was a catch-all that…
0
votes
1 answer

Make Google Apps site publicly accessible while disabling crawlers with robots.txt?

I would like to create a publicly accessible Google Apps site (i.e. users do not need to be authenticated to access the content) while maintaining a policy crawlers and bots exclusion with Robots.txt. Does anyone know how to do that?
Joannes Vermorel
  • 493
  • 2
  • 5
  • 13
0
votes
1 answer

How to prevent search engines indexing a specific url

I have a url which i don't want indexed: http://www.mysite.com/moduleA?param=secretkey So when i google search for "mysite.com", i don't want the above link to appear in the search results. However, the following urls are part of public…
Parag
  • 123
  • 5
-1
votes
2 answers

How to disallow crawling for all subdomains using my main domain's physical robots.txt file

I have multiple physical sub-domains and I don't want to change any robots.txt file of any of that sub-domains. Is there any way to disallow all the sub-domains from my main domain's physical robots.txt file without using any sub-domain's physical…
Aditya Shah
  • 101
  • 3
-1
votes
1 answer

Robots.txt - disallow crawling one directory on subdomain

I have placed my product showcase on subdomain such as http://demo.domain.com/productname/. Demo version of product is located at http://demo.domain.com/productname/demo/. I would like to disallow crawling demo version, can someone help me ?
w3bariak
  • 1
  • 1
-1
votes
1 answer

Block all bots from .ru domains via robots.txt or htaccess?

Is there any way to write rules for robots.txt or htaccess that will block all bots that come from a .ru domain? Thanks
bob
  • 1
  • 2
-1
votes
1 answer

Disallow: /?q=search/ in robot.txt

Does /?q=search/ means I can't web scrape the search websites that ends with =search/ ? Can I scrape an URL ends with =0#search ?
-1
votes
1 answer

How to write disallow paths for comments when their urls keep on changing

I want to disallow pages that develop after someone comments on my post. For example, recently I got a comment (number 12) on my post page and a new page was formed with URL: https://example.com/post/#comment-12 Now if another person comments…
-2
votes
2 answers

how to rewrite or redirect old or missing or invalid url to 404 page

Possible Duplicate: Everything You Ever Wanted to Know about Mod_Rewrite Rules but Were Afraid to Ask? I recently upgraded a site and almost all URLs have changed. I have redirected all of them (or so I hope) but it may be possible that some of…
david
  • 1
  • 1
  • 4
-3
votes
2 answers

Restricting Access from BOTS

I would like to protect my server from too many hits from Bots. Considering a scenario, where in a server (physical) located in a private network and hitting my server continuously. Do i have a mechanism to identify the server behind the hits, say…
-4
votes
1 answer

How Can I Encourage Google to scan New robots.txt File?

I just updated my robots.txt file on a new site; Google Webmaster Tools reports it read my robots.txt 2 days before my last update. my last robots.txt had a "disallow: all" raw. Is there any way I can encourage Google to re-read my robots.txt as…
1 2 3 4 5
6