0

I'm not quite sure whether this is the suitable forum to post my question. I'm analyzing web server logs both in Apache and IIS log formats. I want to find the evidences for automatic browsing(Ex. Web robots,spiders,bots etc.) I used python robot-detection 0.2.8 for detecting robots in my log files. Anyway there may be other robots(automatic programs) which have traversed through the web site but robot-detection can not identify.

  1. So are there any specific clues that can be found in log files(that human users do not perform but software perform actions etc)?
  2. Do they follow a specific navigation pattern?
  3. I saw some requests for favicon.ico? Does this implicate that it is a automatic browsing?.

I found this article with some valuable points.

Nilani Algiriyage
  • 32,876
  • 32
  • 87
  • 121

1 Answers1

1

The article on how to identify robots has some good information. Other things you might consider.

  • If you see a request for an HTML page, but it isn't followed by requests for the images or script files that the page uses, it's very likely that the request came from a crawler. If you see lots of those from the same IP address, it's almost certainly a crawler. It could be the Lynx browser (text only), but it's more likely a crawler.
  • It's pretty easy to spot a crawler that scans your entire site very quickly. But some crawlers go more slowly, waiting 5 minutes or more between page requests. If you see multiple requests from the same IP address, spread out over time but at very regular intervals, it's probably a crawler.
  • Repeated 403 (Unauthorized) entries in the log from the same IP. It's rare that a human will suffer through more than a handful of 403 errors before giving up. An unsophisticated crawler will blindly try URLs on the site, even if it gets dozens of 403s.
  • Repeated 404's from the same IP address. Again, a human will give up after some small number of 404s. A crawler will blindly push on ... "I know there's a good URL in here somewhere."
  • A user-agent string that isn't one of the major browsers' agent strings. If the user-agent string doesn't look like a browser's user agent string, it's probably a bot. Note that the reverse isn't true; many bots set the user agent string to a known browser user agent string.
Jim Mischel
  • 131,090
  • 20
  • 188
  • 351