12

Searching for specific information on the robots.txt, I stumbled upon a Yandex help page on this topic. It suggests that I could use the Host directive to tell crawlers my preferred mirror domain:

User-Agent: *
Disallow: /dir/
Host: www.example.com

Also, the Wikipedia article states that Google too understands the Host directive, but there wasn’t much (i.e. none) information.

At robotstxt.org, I didn’t find anything on Host (or Crawl-delay as stated on Wikipedia).

  1. Is it encouraged to use the Host directive at all?
  2. Are there any resources at Google on this robots.txt specific?
  3. How is compatibility with other crawlers?

At least since the beginning of 2021, the linked entry does not deal with the directive in question any longer.

dakab
  • 5,379
  • 9
  • 43
  • 67
  • This question appears to be off-topic because it is about SEO – John Conde Feb 25 '14 at 12:40
  • 4
    It’s about a technical aspect of hostnames and robots.txt, and it’s tagged “seo” and “robots.txt”. How does it appear off-topic? – dakab Feb 25 '14 at 14:15
  • 1
    If anyone is looking for Yandex host directive spec, here's a link: https://web.archive.org/web/20190102064128/https://yandex.com/support/webmaster/controlling-robot/robots-txt.html – t1gor Aug 25 '21 at 14:21

1 Answers1

14

The original robots.txt specification says:

Unrecognised headers are ignored.

They call it "headers" but this term is not defined anywhere. But as it’s mentioned in the section about the format, and in the same paragraph as User-agent and Disallow, it seems safe to assume that "headers" means "field names".

So yes, you can use Host or any other field name.

  • Robots.txt parsers that support such fields, well, support them.
  • Robots.txt parsers that don’t support such fields must ignore them.

But keep in mind: As they are not specified by the robots.txt project, you can’t be sure that different parsers support this field in the same way. So you’d have to check every supporting parser manually.

unor
  • 92,415
  • 26
  • 211
  • 360
  • So `Host` is someone else’s addendum to the robots exclusion standard, as it’s not defined at robotstxt.org‽ – dakab Feb 26 '14 at 12:42
  • 2
    @dakab: Yes, the `Host` field is not specified in the original robots.txt specification. – unor Feb 26 '14 at 14:23