6

When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its User-Agent?

Everything I have read does not explicitly answer this question. According to another StackOverflow thread it is an exact match.

However, the RFC draft makes me believe that it is a substring match. For example, User-Agent: Google will match "Googlebot" and "Googlebot-News". Here is the relevant quotation from the RFC:

The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring.

Additionally, in the "Order of precedence for user-agents" section of Googlebot's documentation it explains that the user agent for Google Images "Googlebot-Image/1.0" match for User-Agent: googlebot.

I would appreciate any clarity here, and the answer may be more complicated than my question. For example, Eugene Kalinin's robots module for node mentions splitting the User-Agent to get the "name token" on line 29 and matching against that. If this is true, then Googlebot's User-Agent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" will not match User-Agent: Googlebot.

DocRoot
  • 1,176
  • 10
  • 17
josephdpurcell
  • 1,157
  • 3
  • 16
  • 34

2 Answers2

5

In the original robots.txt specification (from 1994), it says:

User-agent

[…]

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

[…]

If and which bots/parsers comply with this is another question and can’t be answered in general.

unor
  • 92,415
  • 26
  • 211
  • 360
  • This was a tough decision. I really liked plasticinsect's answer, but I think this is the most "correct". It sounds like crawlers should read robots.txt User-Agent lines as a case insensitive substring match, but each apply their own rules as plasticinsect says. – josephdpurcell Aug 26 '13 at 17:34
3

Every robot does this a little differently. There is really no single reliable way to map the user-agent in robots.txt to the user-agent sent in the request headers. The safest thing to do is to treat them as two separate, arbitrary strings. The only 100% reliable way to find the robots.txt user-agent is to read the official documentation for the given robot.

Edit:

Your best bet is generally to read the official documentation for the given robot, but even this is not 100% accurate. As Michael Marr points out, Google has a robots.txt testing tool that can be used to verify which UA will work with a given robot. This tool reveals that their documentation is inaccurate. Specifically, the page https://developers.google.com/webmasters/control-crawl-index/docs/ claims that their media partner bots respond to the 'Googlebot' UA, but the tool shows that they don't.

plasticinsect
  • 1,702
  • 1
  • 13
  • 23
  • 3
    "The only 100% reliable way to find the robots.txt user-agent is to read the official documentation for the given robot." <-- This. RE: Google, you can go to webmaster tools and crawl > blocked URLs to test various user agents and paths to see how Google specifically treats user agent strings. I did this and saw that "User-Agent: Google" matched nothing (according to the tool), and, as you referenced, Googlebot matched their regular bot, image bot, and mobile bot - but, it did NOT match their ad and media partner bots. At least for Google's implementation, the UA is not regex, but keyword. – Michael Marr Aug 04 '13 at 04:29
  • @Michael Marr - Yes, you have a point. I tried your experiment and got the same result. The worst part is that this actually contradicts the official documentation. Apparently the docs are out-of-date. I have edited my answer. – plasticinsect Aug 05 '13 at 18:08