-1

It sounds like a simple question. Exclude the waybackmachine crawler (ia_archiver) and allow all other user agents.

So I setup the robots.txt as follows:

User-agent: *

Sitemap: https://www.example.com/sitemap.xml


User-agent: ia_archiver
Disallow: /

After half a year I noticed that the visitor count to my site dropped tremendously.

After a while I realized that Google Bot stopped indexing my site.

Confirmed by their robots.txt verifier:

robots.txt checker

The Disallow: / part is picked up by google bot too, not only ia_archiver is blocked.

The obvious question is:

What is wrong with this robots.txt?

Is the order of the entries the culprit?

Avatar
  • 14,622
  • 9
  • 119
  • 198
  • See: https://www.robotstxt.org/robotstxt.html – Ouroborus May 07 '23 at 05:58
  • This does not solve the issue. – Avatar May 07 '23 at 05:59
  • [Google](https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt?hl=en) seems to say you did *not* properly set your file to allow their bot though. – OldPadawan May 07 '23 at 06:02
  • 1
    Sure it does. In its examples, it shows that you need at least one `Disallow` for each `User-agent`. It also tells you how to set it up so that google is allowed but other bots are not allowed. – Ouroborus May 07 '23 at 06:02
  • In your link there is an entry "To allow a single robot", but not the case I described in the question. Please read the question again. – Avatar May 07 '23 at 06:03
  • 1
    It's exactly the case you describe. – Ouroborus May 07 '23 at 06:04
  • I want to allow all bots, but not the Wayback Machine. – Avatar May 07 '23 at 06:05
  • 1
    So add an empty `Disallow` to your `User-agent: *`. Again, you need at least one `Disallow` for each `User-agent`. – Ouroborus May 07 '23 at 06:05
  • Having `User-agent: * Disallow:` followed by `User-agent: ia_archiver Disallow: /` – does this still block the ia_archiver? – Avatar May 07 '23 at 06:08
  • That's what the docs say should happen. – Ouroborus May 07 '23 at 06:11
  • 1
    Oh, you also probably need the `*` entry to be the last `User-agent`. It's not part of the docs, but implementation has been left up to developers so, often, the first match is the one that's obeyed. – Ouroborus May 07 '23 at 06:13

1 Answers1

1

The solution:

User-agent: ia_archiver
Disallow: /

User-agent: *
Disallow: 

Sitemap: https://www.example.com/sitemap.xml

ia_archiver must come first.

The empty Disallow: allows all other user agents to crawl the site.

Avatar
  • 14,622
  • 9
  • 119
  • 198
  • 1
    The order of the `User-agent` lines doesn't matter if the crawler is implemented according to the spec. 3.2.1 of the robots.txt standard says that user agents have to first look for `User-agent` lines that match their name, and then fall back to `*`. – Stephen Ostermiller May 07 '23 at 10:43