0

I'm trying to prevent Google, Yahoo et al from hitting my /products/ID/purchase page and am unsure on how to do it.

I currently block them from hitting sign in with the following:

User-agent: *
Disallow: /sign_in

Can I do something like the following?

User-agent: *
Disallow: /products/*/purchase

Or should it be:

User-agent: *
Disallow: /purchase
Gerard
  • 4,818
  • 5
  • 51
  • 80

2 Answers2

2

I assume you want to block /products/ID/purchase but allow /products/ID.

Your last suggestion would only block pages that start with "purchase":

User-agent: *
Disallow: /purchase

So this is not what you want.

You'd need your second suggestion:

User-agent: *
Disallow: /products/*/purchase

This would block all URLs that start with /products/, followed by any character(s), followed by /purchase.

Note: It uses the wildcard *. In the original robots.txt "specification", this is not a character with special meaning. However, some search engines extended the spec and use it as a kind of wildcard. So it should work for Google and probably some other search engines, but you can't bet that it would work with all the other crawlers/bots.

So your robots.txt could look like:

User-agent: *
Disallow: /sign_in
Disallow: /products/*/purchase

Also note that some search engines (including Google) might still list a URL in their search results (without title/snippet) although it is blocked in robots.txt. This might be the case when they find a link to a blocked page on a page that is allowed to be crawled. To prevent this, you'd have to noindex the document.

unor
  • 92,415
  • 26
  • 211
  • 360
0

According to Google Disallow: /products/*/purchase should work. But according to robotstxt.org this doesn't work.

simonmenke
  • 2,819
  • 19
  • 28