6

For some reason when I check on Google Webmaster Tool's "Analyze robots.txt" to see which urls are blocked by our robots.txt file, it's not what I'm expecting. Here is a snippet from the beginning of our file:

Sitemap: http://[omitted]/sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: http://[omitted]/Living/books/book-review-not-stupid.aspx
Disallow: http://[omitted]/Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: http://[omitted]/Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

Anything in the scripts folder are correctly blocked for both the Googlebot and Mediapartners-Google. I can see that the two robots are seeing the correct directive because the Googlebot says the scripts are blocked from line 7 while the Mediapartners-Google is blocked from line 4. And yet ANY other url I put in from the disallowed urls under the second user-agent directive are NOT blocked!

I'm wondering if my comment or using absolute urls are screwing things up...

Any insight is appreciated. Thanks.

4 Answers4

11

The reason why they are ignored is that you have the fully qualified URL in the robots.txt file for Disallow entries while the specification doesn't allow it. (You should only specify relative paths, or absolute paths using /). Try the following:

Sitemap: /sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: /Living/books/book-review-not-stupid.aspx
Disallow: /Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: /Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

As for caching, google tries to get a copy of the robots.txt file every 24 hours in average.

Andrew Moore
  • 93,497
  • 30
  • 163
  • 175
  • Is that first line correct? http://www.sitemaps.org/protocol.php#submit_robots indicates that the sitemap location should be the complete URL. – David Citron Mar 28 '09 at 15:21
  • Site map with full URL is ok, but your disallow lists should still be absolute. – Andrew Moore Mar 28 '09 at 19:52
  • Following David Z below, wouldn't this formulation be a bit clearer?: Site map with full URL is ok, but disallow lists should be relative URLs based on the document root. – tuk0z Jan 23 '15 at 11:11
2

It's the absolute URLs. robots.txt is only supposed to include relative URIs; the domain is inferred based on the domain that the robots.txt was accessed from.

David Z
  • 128,184
  • 27
  • 255
  • 279
0

It's been up for at least a week, and Google says it was last downloaded 3 hours ago, so I'm sure it's recent.

  • 1
    You're probably better off editing the original question (typically by putting EIDT in bold at the bottom followed by the extra information) rather than answering your own question (I realize you can't comment yet). – cletus Jan 20 '09 at 23:48
-1

Did you recently make this change to your robots.txt file? In my experience it seems that google caches that stuff for a really long time.

Webjedi
  • 4,677
  • 7
  • 42
  • 59