Meaning of Disallow: /*? in robots.txt

Question

User-agent: *
Disallow: /p/
Disallow: /r/
Disallow: /*?

What does the last line mean? ("Disallow: /*?")

I'm voting to close this question as off-topic because it belongs to web development, not to business it management. — Daniel, Sep 19 '16 at 07:32
@Daniel FYI At the time of posting the webmasters SE did not yet exist and we can't migrate questions older than 30 days... — HBruijn, Sep 19 '16 at 11:26

score 5 · Answer 1 · answered May 06 '10 at 06:33

If it was a Perl regular expression:

*?     Match 0 or more times, not greedily

http://perldoc.perl.org/perlre.html

However robots.txt follows a really basic grammar, as such,

To match a sequence of characters, use an asterisk (*). For instance, to block access to all subdirectories that begin with private:
User-agent: Googlebot
Disallow: /private*/
To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):
User-agent: Googlebot
Disallow: /*?
To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:
User-agent: Googlebot 
Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
User-agent: *
Allow: /*?$
Disallow: /*?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

So basically any kind of query or search on Yahoo! is prohibited by a robot.

The expression support is confusingly not listed in the RFC, http://www.robotstxt.org/norobots-rfc.txt

The best description is provided by Google, http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=156449

Every Yahoo! domain has a different robots.txt, check out Yahoo! from the homepage almost every link goes to a different domain with different restrictions. — Steve-o, May 06 '10 at 06:57
It is very important to understand that many/most crawlers do not understand wildcards in robots.txt files, because that is not part of the specification. According to http://robotstxt.org/, the only place where an asterisk is explicitly allowed is in the "User-agent" field. — Skyhawk, Jul 23 '10 at 22:50

score 0 · Answer 2 · answered May 06 '10 at 06:33

0

The * makes it a wildcard. So, uri that ends with a ? would be restricted.

answered May 06 '10 at 06:33

Mike Chess

289
3
12

1

That would require "/*?$". – Steve-o May 06 '10 at 06:47

Meaning of Disallow: /*? in robots.txt

2 Answers2