My specific problem is that I have a set of Apache access logs, and I want to extract from them a “rolled up” count of requests by grouping them into a set of time windows of a specified time.
Example of my data:
127.0.0.1 - - [01/Dec/2011:00:00:11 -0500] "GET / HTTP/1.0" 304 266 "-" "Sosospider+(+http://help.soso.com/webspider.htm)"
127.0.0.1 - - [01/Dec/2011:00:00:24 -0500] "GET /feed/rss2/ HTTP/1.0" 301 447 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=12878631678486589417)"
127.0.0.1 - - [01/Dec/2011:00:00:25 -0500] "GET /feed/ HTTP/1.0" 304 189 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=12878631678486589417)"
127.0.0.1 - - [01/Dec/2011:00:00:30 -0500] "GET /robots.txt HTTP/1.0" 200 333 "-" "Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/)"
127.0.0.1 - - [01/Dec/2011:00:00:30 -0500] "GET / HTTP/1.0" 200 10011 "-" "Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/)"
as you can see, each line represents an event — in this case, a HTTP request — and contains a timestamp.
Assuming my data covers 3 days, and I specify a time window size of 1 day, I’d like to generate something like this:
Start End Count
2011-12-01 05:00 2011-12-02 05:00 2822
2011-12-02 05:00 2011-12-03 05:00 2572
2011-12-03 05:00 2011-12-04 05:00 604
But I need to be able to vary the size of the window — I might want to analyze a given dataset using windows of 5 minutes, 10 minutes, 1 hour, 1 day, or 1 week, etc.
I also need the library/tool to be capable of analyzing a dataset (a series of lines) of hundreds or even thousands of megabytes in size.
A prebuilt tool which can accept the data via standard input would be great, but a library would be totally fine, as I could just build the tool around the library. Any language would be fine; if I don’t know it I can learn it.
I’d prefer to do this by piping the access log data directly into a tool/library with minimal dependencies — I’m not looking for suggestions to store the data in a database and then query the database to do the analysis. If I need to, I can figure that out myself.
I tried Splunk and found it way too heavyweight and complex for my case. It’s not just a tool, it’s a whole system with its own datastore, complex indexing and querying abilities, etc.
My question is: does such a library and/or tool exist?
Full disclosure
I must admit, I actually tried and failed to find something like this a few months ago, so I wrote my own. For some reason I didn’t think to post this question at that time. I will share the lib/tool I wrote in an answer shortly. But I really am curious if something like this does exist; maybe I just missed it when I was searching a few months ago.