Using robots.txt to exclude one specific user-agent and allowing all others?

Question

It sounds like a simple question. Exclude the waybackmachine crawler (ia_archiver) and allow all other user agents.

So I setup the robots.txt as follows:

User-agent: *

Sitemap: https://www.example.com/sitemap.xml


User-agent: ia_archiver
Disallow: /

After half a year I noticed that the visitor count to my site dropped tremendously.

After a while I realized that Google Bot stopped indexing my site.

Confirmed by their robots.txt verifier:

The Disallow: / part is picked up by google bot too, not only ia_archiver is blocked.

The obvious question is:

What is wrong with this robots.txt?

Is the order of the entries the culprit?

[Google](https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt?hl=en) seems to say you did *not* properly set your file to allow their bot though. — OldPadawan, May 07 '23 at 06:02
Sure it does. In its examples, it shows that you need at least one `Disallow` for each `User-agent`. It also tells you how to set it up so that google is allowed but other bots are not allowed. — Ouroborus, May 07 '23 at 06:02
In your link there is an entry "To allow a single robot", but not the case I described in the question. Please read the question again. — Avatar, May 07 '23 at 06:03
So add an empty `Disallow` to your `User-agent: *`. Again, you need at least one `Disallow` for each `User-agent`. — Ouroborus, May 07 '23 at 06:05
Having `User-agent: * Disallow:` followed by `User-agent: ia_archiver Disallow: /` – does this still block the ia_archiver? — Avatar, May 07 '23 at 06:08
Oh, you also probably need the `*` entry to be the last `User-agent`. It's not part of the docs, but implementation has been left up to developers so, often, the first match is the one that's obeyed. — Ouroborus, May 07 '23 at 06:13

score 1 · Answer 1 · answered May 07 '23 at 06:21

1

The solution:

User-agent: ia_archiver
Disallow: /

User-agent: *
Disallow: 

Sitemap: https://www.example.com/sitemap.xml

ia_archiver must come first.

The empty Disallow: allows all other user agents to crawl the site.

answered May 07 '23 at 06:21

Avatar

1

The order of the `User-agent` lines doesn't matter if the crawler is implemented according to the spec. 3.2.1 of the robots.txt standard says that user agents have to first look for `User-agent` lines that match their name, and then fall back to `*`. – Stephen Ostermiller May 07 '23 at 10:43

1 Answers1