Blocking URLs that contain numbers in robots.txt

Question

My website allows search engines to index the same page in 2 formats like:

‪www.example.com/page-1271.html‬
www.example.com/page-1271-page-title.html

All my site pages are like that. So, How can I block the first format in robots.txt file? I mean is there such a code like:

Disallow: /page-(numbers).html

Note that it is `robots.txt`, not `robot.txt`. I corrected it in your question. — unor, Jun 12 '13 at 14:56

score 1 · Answer 1 · answered Jun 12 '13 at 15:17

The original robots.txt specification has not defined any wildcards. (However, some parsers, like Google, have added wildcard support anyhow.)

If your concern is that search engines only index one of your two variants, there are alternatives to robots.txt:

You could redirect (with 301) from example.com/page-1271.html‬ to example.com/page-1271-page-title.html. This solution would be the best, as now everyone (users, bots) will work with the same URL.

Or you could use the canonical link relation. On example.com/page-1271.html‬ (or on both variants) you could add a link element to the head:

<link href="example.com/page-1271-page-title.html" rel="canonical" />

This tells search engine bots etc. to use the canonical URL instead of the current URL.

score 0 · Answer 2 · answered Jun 10 '13 at 21:59

There is no such regexp option in robots.txt. You have a couple of options:

1) Place the robots disallow information into the head element in the html files. 2) Write a script that will add every blockable html file as a separate line into the robots.txt 3) Place content pages in a separate directory and disallow access to that directory.

Some search engines (such as Google), but not all of them, respects pattern matching: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449&from=35237&rd=1

User-agent: *
Disallow: /page-*.html
Allow: /page-*-page-title.html

Here the Allow overrides the Disallow, this also is not supported by all search engines. Easiest would be to restructure your files (or make URL rewrites) or then place robots information into the html files themselves.

Thank you, Actually they are not html files, they are php and every single page can be read in the above formats, so I can't place disallow information into the head because this will disallow the 2 URLs, that's why I can't also move the unwanted URLs in a separate directory. Thank you for your help — hatem tawfik, Jun 10 '13 at 22:21

Blocking URLs that contain numbers in robots.txt

2 Answers2