Is there a list of known web crawlers?

Question

I'm trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I'm not sure, they may or may not be a web crawler and they are causing many downloads so it's important for me to know.

Is there somewhere a list of know web crawlers with some documentation like user agent, IPs, behavior, etc?

I'm not interested in the official ones, like Google's, Yahoo's, or Microsoft's. Those are generally well behaved and self-indentified.

score 12 · Accepted Answer · edited Mar 21 '10 at 11:51

12

I'm using http://www.user-agents.org/ usually as reference, hope this helps you out.

You can also try http://www.robotstxt.org/db.html or http://www.botsvsbrowsers.com.

edited Mar 21 '10 at 11:51

Pablo Fernandez

279,434
135
377
622

answered Nov 14 '09 at 07:37

Jaan J

524
5
13

score 7 · Answer 2 · answered Apr 06 '15 at 12:07

7

I'm maintaining a list of crawler's user-agent patterns at https://github.com/monperrus/crawler-user-agents/.

It's collaborative, you can contribute to it with pull requests.

answered Apr 06 '15 at 12:07

Martin Monperrus

1,845
2
19
28

score 4 · Answer 3 · edited Nov 14 '09 at 07:56

4

http://www.robotstxt.org/db.html is a good place to start. They have an automatable raw feed if you need that too. http://www.botsvsbrowsers.com/ is also helpful.

edited Nov 14 '09 at 07:56

Pablo Fernandez

279,434
135
377
622

answered Nov 14 '09 at 07:36

Justin Grant

44,807
15
124
208

score 3 · Answer 4 · answered Nov 14 '09 at 07:45

3

Unfortunately we've found that bot activity is too numerous and varied to be able to accurately filter it. If you want accurate download counts, your best bet is to require javascript to trigger the download. That's basically the only thing that is going to reliably filter out the bots. It's also why all site traffic analytics engines these days are javascript based.

answered Nov 14 '09 at 07:45

jwanagel

4,078
1
25
33

The problem in our case is that we have many valid downloaders that won't run JavaScript, like iTunes or any other podcatcher. – Pablo Fernandez Nov 14 '09 at 07:57
Unfortunately you're really out of luck then as far as highly accurate download counts. The best alternative I can recommend is looking at three numbers: Total downloads (no filtering), filter for excluding bots (black list filtering), and filter for including known good (white list filtering). That will at least give you something to look at for trends and rough ball-park estimating. – jwanagel Nov 14 '09 at 09:01
Sorry, but requiring javascript will filter out legit users too. At the same time the amount of websites requiring javascript for showing any content at all incentivizes bots to run javascript. – Yakov Galka Oct 30 '20 at 19:46

Is there a list of known web crawlers?

4 Answers4

Linked