0

I have URLs like example.com/post/alai-fm-sri-lanka-listen-online-1467/

I want to remove all URLs which have post word in them using robots.txt

So which is corrent format?

Disallow: /post-*

Disallow: /?page=post

Disallow: /*page=post
razaulmustafa
  • 35
  • 1
  • 4

2 Answers2

2

(Note that the file has to be called robots.txt; I corrected it in your question.)

You only included one example URL, where "post" is the first path segment. If all your URLs look like that, the following robots.txt should work:

User-agent: *
Disallow: /post/

It would block the following URLs:

  • http://example.com/post/
  • http://example.com/post/foobar
  • http://example.com/post/foo/bar

The following URLs would still be allowed:

  • http://example.com/post
  • http://example.com/foo/post/
  • http://example.com/foo/bar/post
  • http://example.com/foo?page=post
  • http://example.com/foo?post=1
unor
  • 92,415
  • 26
  • 211
  • 360
  • Thank your unor.I have almost all urls like example above.But one more confusion.All pages are indexed in google.What to do to remove all old urls which are starting from word "post" should automatically removed from google.I have put this in my robots.txt file.Please check this and let me know is it fine to remove urls which are already indexed in google or i have to do some think else? http://www.zustream.com/robots.txt – razaulmustafa Nov 20 '13 at 11:48
  • @razaulmustafa: Your current robots.txt doesn’t block URLs starting with `/post/`. Also, robots.txt doesn’t necessarily remove URLs from search engine results. It only forbids that search engines crawl the content of the blocked URLs (=pages). If you want to remove your URLs from search engine indexes, that’s another question (probably for [webmasters.se]). – unor Nov 20 '13 at 12:06
  • Actually i removed whole site from google search using google webmaster tool.Now i want to reindex my complete site with new URLS.what changes should i made so that google index only new URLs.not old ones.thanks – razaulmustafa Nov 20 '13 at 12:35
  • @razaulmustafa: That seems to be a different question (you could use 301 redirects or `rel`-`canonical` on your old pages/URLs; etc.). – unor Nov 20 '13 at 13:15
1

Googlebot and Bingbot both handle limited wildcarding, so this will work:

Disallow: /*post

Of course, that will also disallow any url that contains the words "compost", "outpost", "poster", or anything that contains the substring "post".

You could try to make it a little better. For example:

Disallow: /*/post    // any segment that starts with "post"
Disallow: /*?post=   // the post query parameter
Disallow: /*=post    // any value that starts with "post"

Understand, though, that not all bots support wildcards, and of those that do some are buggy. Bing and Google handle them correctly. There's no guarantee if other bots do.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351