1

Can anyone help me add a disallow rule to my robots.txt file that will stop crawlers indexing any link containing %2C which is the HTML URL encoding for a comma (,).

I think what I'm looking for is the wild card character if one exists in the robots.txt file.

So far I have this:

Disallow: %2C

But cannot seem to see it working.

Any suggestions? Cheers

Gga
  • 4,311
  • 14
  • 39
  • 74
  • try `/*,` or `/*%2C` also see http://stackoverflow.com/questions/6859399/block-google-robots-for-urls-containing-a-certain-word – Prasanth Sep 06 '12 at 10:32
  • 1
    @goldenparrot I was thinking Disallow: /*%2C the * to allow any characters before? – Gga Sep 06 '12 at 10:33
  • @RodgersandHammertime it's not a regular expression or a wildcard (read after)! Take a look to http://www.robotstxt.org/orig.html for what's allowed and somehow *standard* in robots.txt file. To read as: Disallow accepts full or partial rooted URLs. – Adriano Repetti Sep 06 '12 at 10:37

1 Answers1

4

The best thing when testing robots.txt against the search engines is to utilize the tools they provide to you. Google Webmaster Tools has a robots.txt tester under "Health > Blocked URLs". If you use

User-agent: *
Disallow: *,*

this will block any requests for http://example.com/url%2Cpath/. I tried Disallow: *%2C* but apparently that doesn't block Googlebot from crawling the HTML escaped path. My guess is that Googlebot encodes it in the queuing process.

As for bing, they apparently removed their robots.txt validation tool. So really the only sure way of testing it, is to deploy a robots.txt on a test site, and the use Bing Webmaster Tools to fetch a page with the ','. It'll tell you at that point if it's blocked by robots.txt or not.

Remember when using robots.txt, that doesn't prevent the search engines from displaying the URL in the search results. It just prevents them from crawling the URL. If you simply don't want those type of URLs in the search results, but don't mind them crawling the page (meaning you can't block those URLs with robots.txt), you can add a meta tag or x-robots-tag in the http headers with a value of NOINDEX to prevent it from being added to the search results.

Regarding one of the other comments about using the "nofollow" standard. Nofollow doesn't actually prevent the search engines from crawling those URLs. It's more recognized as a way to disavowing any endorsement of that link to the destination. Google and Bing have suggested using nofollow to indicate sponsored links or untrusted UGC links.

eywu
  • 2,654
  • 1
  • 22
  • 24