How can I get robots.txt to block access to URLs on site after "?" character but index page itself?

Question

I have a small magento site which consists of page URLs such as:

http://www.example.com/contact-us.html
http://www.example.com/customer/account/login/

However I also have pages which include filters (e.g. price and colour) and two such examples are:

http://www.example.com/products.html?price=1%2C1000
http://www.example.com/products/chairs.html?price=1%2C1000

The issue is that when Google bot and the other search engine bots search the site, it essentially grinds to a halt because they get stuck in all the "filter links".

So, in the robots.txt file how can it be configured e.g:

User-agent: *
Allow:
Disallow:

To allow all pages like:

http://www.example.com/contact-us.html
http://www.example.com/customer/account/login/

to get indexed but in the case of http://www.example.com/products/chairs.html?price=1%2C1000 index products.html, but ignore all the content after the ?? The same should apply for http://www.example.com/products/chairs.html?price=1%2C1000

I also don't want to have to specify each page, in turn just a rule to ignore everything after the ? but not the main page itself.

The issue is that my urls don't just have query string for "price" nor do they have a forward slash before the point I want it to stop as show in other robots.txt examples. So as another url example: http://www.mysite.com/products/chairs.html?manufacturer=128&usage=165 I want google to still index: http://www.mysite.com/products/ and http://www.mysite.com/products/chairs.html But as stop it accessing the page which includes the filter displaying "manufacturer=128&usage=165" i.e. always stop at the ? — Christine M. Reaves, Sep 16 '11 at 23:17

score 8 · Accepted Answer · edited May 23 '17 at 12:24

I think this will do it:

User-Agent: *
Disallow: /*?

That will disallow any url that contains a question mark.

If you want to disallow just those that have ?price, you would write:

Disallow: /*?price

See related questions (list on the right) such as:

Restrict robot access for (specific) query string (parameter) values?

How to disallow search pages from robots.txt

Additional explanation:

The syntax Disallow: /*? says, "disallow any url that has a question mark in it." The / is the start of the path-and-query part of the url. So if your url is http://mysite.com/products/chairs.html?manufacturer=128&usage=165, the path-and-query part is /products/chairs.html?manufacturer=128&usage=165. The * says "match any character". So Disallow: /*? will match /<anything>?<more stuff> -- anything that has a question mark in it.

score 0 · Answer 2 · answered Jul 16 '14 at 13:20

Jim Mischel is correct. Using the wildcards that he's mentioned you can block out particular querystrings from being crawled - bearing in mind that only the major search engines support the use of wildcards in the robots.txt.

You can then test your rules before applying them using Google Webmaster Tools robot testing tool: https://www.google.com/webmasters/tools/robots-testing-tool.

score -1 · Answer 3 · edited Apr 04 '14 at 12:46

-1

I'll help

Within Posts track gets my web html one without and one with html

Do you want a way to help close the html is this true

This is true block in robots

Disallow: .Html

edited Apr 04 '14 at 12:46

Ashish

14,295
21
82
127

answered Apr 04 '14 at 12:15

giasi

1

score -1 · Answer 4 · answered Sep 16 '11 at 22:15

-1

You should be able to do:

Disallow: /?price=*

or even:

Disallow: /?*

answered Sep 16 '11 at 22:15

Lee H

294
1
9

1

I don't think that's going to work. It will block all urls that *start with* `/?` or `/?price=`. In addition, the trailing asterisk is not required. It's implied by the specification. – Jim Mischel Sep 16 '11 at 22:32

How can I get robots.txt to block access to URLs on site after "?" character but index page itself?

4 Answers4