0

Here's the section for every other bot besides Google and co.

# Every bot that might possibly read and respect this file.
User-agent: *
Allow: /search
Disallow: /search/users
Disallow: /search/*/grid

Disallow: /*?
Disallow: /*/with_friends
Disallow: /oauth
Disallow: /1/oauth

Does "Dissallow: /*?" disallow all URLs, in which case the rules below are redundant, or it disallows only URLs which contain a question mark?

More generally, I'm interested in knowing if i'm allowed to go to the profile page of a person and follow automatically the link to their personal website. No scraping in the middle, just following the link.

Thanks,

Raz

Raz
  • 191
  • 3
  • 7

1 Answers1

1

The robots.txt spec only allows * as a wildcard, so /*? disallows all urls that end in an empty query string. Because ? is not a wildcard, /*? does not disallow ALL urls, just those that end in a ?.

Marc B
  • 356,200
  • 43
  • 426
  • 500
  • I see, so "?" is not the "optional" operator in regular expressions. Shouldn't it disallow all URLs which start with whatever is in the rule, in which case also non-empty query string would be disallowed? – Raz Oct 06 '12 at 15:46
  • Agree, the question was if `http://example.com/bla?x=2` is disallowed or not. Anyway, the original question was answered, I was mainly interested whether that disallows all URLs or not. – Raz Oct 06 '12 at 15:49
  • `/*?` in regex form would be `#^/.*\?$#`. – Marc B Oct 06 '12 at 15:52
  • I think it is `#^/.*\?.*$#`. For example `/search` above will allow all pages that start with /search. – Raz Oct 06 '12 at 16:01
  • 1
    @MarcB: "so `/*?` disallows all urls that end in an empty query string" → this is not correct. It disallows all URLs that start with "anything" (`*`), followed by a question mark (`?`), **followed by anything**. In robots.txt `Disallow` you always set URL prefixes. Note that the wildcard `*` is not specified in the original specification. – unor Oct 07 '12 at 01:33