Techniques for Filtering Spiders/Bots during Log Files Analysis

Question

I'll begin by telling you what we do.

The measures we have implemented catch a lot of spiders, but we have no idea how many we are missing. Currently, we apply a set of measures that are obviously partially overlapping:

monitor requests for our robots.txt file: then of course filter all other requests from same IP address + user agent
compare user agent and IP addresses against published lists: iab.net and user-agents.org publish the two lists that seem to be the most widely used for this purpose
pattern analysis: we certainly don't have pre-set thresholds for these metrics but still find them useful. We look at (i) page views as a function of time (i.e., clicking a lot of links with 200 msec on each page is probative); (ii) the path by which the 'user' traverses out Site, is it systematic and complete or nearly so (like following a back-tracking algorithm); and (iii) precisely-timed visits (e.g., 3 am each day).

Again, I am fairly sure we're getting the low-hanging fruit, but I'm interested in getting the views from the community.

score -3 · Accepted Answer · answered Dec 15 '09 at 02:43

-3

These Newsletter posts tagged as Web Log Analysis at
the commercial Web log Analyzer from Nihuo site pages could be useful reading.

answered Dec 15 '09 at 02:43

nik

7,100
2
25
30

Techniques for Filtering Spiders/Bots during Log Files Analysis

1 Answers1