Prevent Search Spiders from accessing a Rails 3 nested resource with robots.txt

Question

I'm trying to prevent Google, Yahoo et al from hitting my /products/ID/purchase page and am unsure on how to do it.

I currently block them from hitting sign in with the following:

User-agent: *
Disallow: /sign_in

Can I do something like the following?

User-agent: *
Disallow: /products/*/purchase

Or should it be:

User-agent: *
Disallow: /purchase

unor · Answer 1 · 2019-03-07T02:09:19.767

I assume you want to block /products/ID/purchase but allow /products/ID.

Your last suggestion would only block pages that start with "purchase":

User-agent: *
Disallow: /purchase

So this is not what you want.

You'd need your second suggestion:

User-agent: *
Disallow: /products/*/purchase

This would block all URLs that start with /products/, followed by any character(s), followed by /purchase.

Note: It uses the wildcard *. In the original robots.txt "specification", this is not a character with special meaning. However, some search engines extended the spec and use it as a kind of wildcard. So it should work for Google and probably some other search engines, but you can't bet that it would work with all the other crawlers/bots.

So your robots.txt could look like:

User-agent: *
Disallow: /sign_in
Disallow: /products/*/purchase

Also note that some search engines (including Google) might still list a URL in their search results (without title/snippet) although it is blocked in robots.txt. This might be the case when they find a link to a blocked page on a page that is allowed to be crawled. To prevent this, you'd have to noindex the document.

score 0 · Answer 2 · answered Oct 31 '12 at 11:23

0

According to Google Disallow: /products/*/purchase should work. But according to robotstxt.org this doesn't work.

answered Oct 31 '12 at 11:23

simonmenke

2,819
19
28

Prevent Search Spiders from accessing a Rails 3 nested resource with robots.txt

2 Answers2