13

I want to allow crawlers to access my domain's root directory (i.e. the index.html file), but nothing deeper (i.e. no subdirectories). I do not want to have to list and deny every subdirectory individually within the robots.txt file. Currently I have the following, but I think it is blocking everything, including stuff in the domain's root.

User-agent: *
Allow: /$
Disallow: /

How can I write my robots.txt to accomplish what I am trying for?

Thanks in advance!

WASa2
  • 131
  • 1
  • 3
  • 1
    This can't be done in a "robot-universal" way. Do you have access to a .htaccess or similar? – alexn Mar 05 '11 at 20:35
  • I do have access to .htaccess. Basically, my goal, using robots.txt, meta tags, and meta http headers, is to do all I personally can to prevent anything but my main page (i.e. index.html) from ending up in a search engine results. – WASa2 Mar 05 '11 at 20:40

2 Answers2

10

There's nothing that will work for all crawlers. There are two options that might be useful to you.

Robots that allow wildcards should support something like:

Disallow: /*/

The major search engine crawlers understand the wildcards, but unfortunately most of the smaller ones don't.

If you have relatively few files in the root and you don't often add new files, you could use Allow to allow access to just those files, and then use Disallow: / to restrict everything else. That is:

User-agent: *
Allow: /index.html
Allow: /coolstuff.jpg
Allow: /morecoolstuff.html
Disallow: /

The order here is important. Crawlers are supposed to take the first match. So if your first rule was Disallow: /, a properly behaving crawler wouldn't get to the following Allow lines.

If a crawler doesn't support Allow, then it's going to see the Disallow: / and not crawl anything on your site. Providing, of course, that it ignores things in robots.txt that it doesn't understand.

All the major search engine crawlers support Allow, and a lot of the smaller ones do, too. It's easy to implement.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
1

In short no there is no way to do this nicely using the robots.txt standard. Remember the Disallow specifies a path prefix. Wildcards and allows are non-standard.

So the following approach (a kludge!) will work.

User-agent: *
Disallow: /a
Disallow: /b
Disallow: /c
...
Disallow: /z
Disallow: /A
Disallow: /B
Disallow: /C
...
Disallow: /Z
Disallow: /0
Disallow: /1
Disallow: /2
...
Disallow: /9
Ben George
  • 975
  • 3
  • 12
  • 23