2

I'll begin by telling you what we do.

The measures we have implemented catch a lot of spiders, but we have no idea how many we are missing. Currently, we apply a set of measures that are obviously partially overlapping:

  1. monitor requests for our robots.txt file: then of course filter all other requests from same IP address + user agent

  2. compare user agent and IP addresses against published lists: iab.net and user-agents.org publish the two lists that seem to be the most widely used for this purpose

  3. pattern analysis: we certainly don't have pre-set thresholds for these metrics but still find them useful. We look at (i) page views as a function of time (i.e., clicking a lot of links with 200 msec on each page is probative); (ii) the path by which the 'user' traverses out Site, is it systematic and complete or nearly so (like following a back-tracking algorithm); and (iii) precisely-timed visits (e.g., 3 am each day).

Again, I am fairly sure we're getting the low-hanging fruit, but I'm interested in getting the views from the community.

masegaloeh
  • 18,236
  • 10
  • 57
  • 106
doug
  • 245
  • 2
  • 10

1 Answers1

-3

These Newsletter posts tagged as Web Log Analysis at
the commercial Web log Analyzer from Nihuo site pages could be useful reading.

nik
  • 7,100
  • 2
  • 25
  • 30