Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
7
votes
2 answers

Defaults for robots meta tag

If I don't specify a robots meta tag in the head of the document, the defaults are: My question is, if I only specify "noindex", is the default still "follow"? So if I specify this below, is the default…
7
votes
2 answers

Ban robots from website

my website is often down because a spider is accessying to many resources. This is what the hosting told me. They told me to ban these IP address: 46.229.164.98 46.229.164.100 46.229.164.101 But I've no idea about how to do this. I've googled a bit…
testermaster
  • 1,031
  • 6
  • 21
  • 40
7
votes
1 answer

Wildcards in robots.txt

If in WordPress website I have categories in this order: -Parent --Child ---Subchild I have permalinks set to: %category%/%postname% Let use an example. I create post with post name "Sport game". It's tag is sport-game. It's full url is:…
user3238424
  • 175
  • 3
  • 12
7
votes
2 answers

how to disallow all dynamic urls robots.txt

how to disallow all dynamic urls in robots.txt Disallow: /?q=admin/ Disallow: /?q=aggregator/ Disallow: /?q=comment/reply/ Disallow: /?q=contact/ Disallow: /?q=logout/ Disallow: /?q=node/add/ Disallow: /?q=search/ Disallow:…
pmarreddy
  • 281
  • 3
  • 6
  • 16
7
votes
1 answer

Allow only Google CSE and disallow Google standard search in ROBOTS.txt

I have a site that I am using a Google Custom Search Engine on. I want Google CSE to crawl my site but I want it to stay out of the results of a regular Google search. I put this in my robots.txt file hoping that google CSE bots would ignore it…
Bender
  • 361
  • 2
  • 3
  • 13
7
votes
8 answers

how to ban crawler 360Spider with robots.txt or .htaccess?

I've got a problems because of 360Spider: this bot makes too many requests per second to my VPS and slows it down (the CPU-usage becomes 10-70%, but usually i have 1-2%). I looked into httpd logs and saw there such lines: 182.118.25.209 - -…
kovpack
  • 4,905
  • 8
  • 38
  • 55
7
votes
1 answer

how to restrict the site from being indexed

I know this question was being asked many times but I want to be more specific. I have a development domain and moved the site there to a subfolder. Let's say from: http://www.example.com/ To: http://www.example.com/backup So I want the subfolder…
Ilian Andreev
  • 1,071
  • 3
  • 12
  • 18
6
votes
4 answers

robots.txt content itself is indexed?

The contents of my robots.txt file are actually itself indexed and show up in Google search results. It's only Google and not Yahoo for example. I really think Google should understand not to index the contents of my robots file as it's only there…
michael
  • 652
  • 9
  • 12
6
votes
3 answers

robots.txt: user-agent: Googlebot disallow: / Google still indexing

Look at the robots.txt of this site: fr2.dk/robots.txt The content is: User-Agent: Googlebot Disallow: / That ought to tell google not to index the site, no? If true, why does the site appear in google searches?
Anders
  • 147
  • 1
  • 1
  • 12
6
votes
4 answers

Googlebot not respecting Robots.txt

For some reason when I check on Google Webmaster Tool's "Analyze robots.txt" to see which urls are blocked by our robots.txt file, it's not what I'm expecting. Here is a snippet from the beginning of our file: Sitemap:…
Andrew
6
votes
3 answers

Why am I getting a 403 for Google AdSense on my verified site?

AdSense shows that it is verified. I have waited about 10 hours and even the placeholder for ads is not appearing. AdSense does not show any Policy violations, Crawler errors, or messages. I found this while inspecting the headers for the adsense…
Dshiz
  • 3,099
  • 3
  • 26
  • 53
6
votes
2 answers

Prevent API Gateway from receiving requests for a robots.txt file

I've been working on a new project that leverages an API Gateway mapped to a lambda function. The lambda function contains a Kestrel .NET web server that receives requests via proxy through API Gateway. I have remapped API Gateway to an actual…
I. Buchan
  • 421
  • 4
  • 13
6
votes
2 answers

robots.txt in Laravel

I just was wondering if the robots.txt file is supposed to work like general robots txt files. So, you type for example "disallow/admin/*" place it into the the root Laravel folder and that's it. Is it like this ?
rolfo85
  • 717
  • 3
  • 11
  • 27
6
votes
2 answers

Robots.txt not working

I have used robots.txt to restrict one of the folders in my site. The folder consists of the sites in under construction. Google has indexed all those sites which are in testing phase. So I used robots.txt. I first submitted the site and robots.txt…
user75472
  • 1,277
  • 4
  • 28
  • 53
6
votes
1 answer

Robots.txt, disallow multilanguage URL

I have a public page that is not supposed be possible for users to sign into. So I have a url that there is no link to and you have to enter manually and then sign in. The url is multilanguage however, so it can be "/SV/Account/Logon" or…
Oskar Kjellin
  • 21,280
  • 10
  • 54
  • 93