Questions tagged [robots.txt]

Robots.txt (the Robots Exclusion Protocol) is a text file placed in the root of a web site domain to give instructions to compliant web robots (such as search engine crawlers) about what pages to crawl and not crawl, as well as other information such as a Sitemap location. In modern frameworks it can be useful to programmatically generate the file. General questions about Search Engine Optimization are more appropriate on the Webmasters StackExchange site.

Website owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a website URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use, so don't try to use /robots.txt to hide information.

More information can be found at http://www.robotstxt.org/.

1426 questions
18
votes
1 answer

Is this robots.txt syntax with an empty "Disallow:" correct?

Today whilst improving my web crawler to support the robots.txt standard, I came across the following code at http://www.w3schools.com/robots.txt User-agent: Mediapartners-Google Disallow: Is this syntax correct? Shouldn't it be Disallow: / or…
dangee1705
  • 3,445
  • 1
  • 21
  • 40
17
votes
4 answers

Is it possible to control the crawl speed by robots.txt?

We can tell bots to crawl or not to crawl our website in robot.txt. On the other hand, we can control the crawling speed in Google Webmasters (how much Google bot crawls the website). I wonder if it is possible to limit the crawler activities by…
Googlebot
  • 15,159
  • 44
  • 133
  • 229
16
votes
2 answers

HTTP header to detect a preload request by Google Chrome

Google Chrome 17 introduced a new feature which preloads a webpage to improve rendering speed upon actually making the request (hitting enter in the omnibar). Two questions: Is there a HTTP header to detect such a request on server side, and if one…
oxygen
  • 5,891
  • 6
  • 37
  • 69
16
votes
2 answers

How can I serve robots.txt on an SPA using React with Firebase hosting?

I have an SPA built using create-react-app and wish to have a robots.txt like this: http://example.com/robots.txt I see on this page that: You need to make sure your server is configured to catch any URL after it's configured to serve from a…
WilliamKF
  • 41,123
  • 68
  • 193
  • 295
16
votes
3 answers

How do I prevent Bing from swamping my site with traffic irregularly?

Bingbot will hit my site pretty hard for a couple of hours each day, and will be extremely light for the rest of the time. I'd either like to smooth out its crawls, reduce its rate limit, or block it altogether. It doesn't really send through any…
Tim Haines
  • 1,496
  • 3
  • 14
  • 16
16
votes
1 answer

Robots.txt syntax not understood

I submitted my robots.txt file ages ago to Google and it is still giving me a syntax not understood for the first line. After Googling the most common problem is Google adding a '?' at the start of the line but it isnt doing that to me. the url to…
Lex Eichner
  • 1,056
  • 3
  • 10
  • 35
15
votes
5 answers

robots.txt in subdirectory

I have a project that lies in a folder below the main domain, and I dont have access to the root of the domain itself. http://mydomain.com/myproject/ I want to disallow indexing on the subfolder…
magnattic
  • 12,638
  • 13
  • 62
  • 115
14
votes
2 answers

Web Crawler - Ignore Robots.txt file?

Some servers have a robots.txt file in order to stop web crawlers from crawling through their websites. Is there a way to make a web crawler ignore the robots.txt file? I am using Mechanize for python.
Craig Locke
  • 755
  • 4
  • 8
  • 12
14
votes
1 answer

Generating a dynamic /robots.txt file in a Next.js app

I need a way to answer dynamically to the /robots.txt request. And that's why I've decided to go with getServerSideProps https://nextjs.org/docs/basic-features/data-fetching#getserversideprops-server-side-rendering If you export an async function…
cbdeveloper
  • 27,898
  • 37
  • 155
  • 336
14
votes
4 answers

Disallow or Noindex on Subdomain with robots.txt

I have dev.example.com and www.example.com hosted on different subdomains. I want crawlers to drop all records of the dev subdomain but keep them on www. I am using git to store the code for both, so ideally I'd like both sites to use the same…
Kirk Ouimet
  • 27,280
  • 43
  • 127
  • 177
13
votes
5 answers

Facebook and Crawl-delay in Robots.txt?

Does Facebook's webcrawling bots respect the Crawl-delay: directive in robots.txt files?
artlung
  • 33,305
  • 16
  • 69
  • 121
13
votes
2 answers

Robots.txt, how to allow access only to domain root, and no deeper?

I want to allow crawlers to access my domain's root directory (i.e. the index.html file), but nothing deeper (i.e. no subdirectories). I do not want to have to list and deny every subdirectory individually within the robots.txt file. Currently I…
WASa2
  • 131
  • 1
  • 3
12
votes
6 answers

How to make a private URL?

I want to create a private url as http://domain.com/content.php?secret_token=XXXXX Then, only visitors who have the exact URL (e.g. received by email) can see the page. We check the $_GET['secret_token'] before displaying the content. My problem is…
Googlebot
  • 15,159
  • 44
  • 133
  • 229
12
votes
1 answer

Can I use the “Host” directive in robots.txt?

Searching for specific information on the robots.txt, I stumbled upon a Yandex help page‡ on this topic. It suggests that I could use the Host directive to tell crawlers my preferred mirror domain: User-Agent: * Disallow: /dir/ Host:…
dakab
  • 5,379
  • 9
  • 43
  • 67
11
votes
6 answers

Rendering plain text through PHP

For some reason, I want to serve my robots.txt via a PHP script. I have setup apache so that the robots.txt file request (infact all file requests) come to a single PHP script. The code I am using to render robots.txt is: echo "User-agent:…
JP19
1 2
3
95 96