1

Goal

I wish to use RRDTool to count logical "user activity" from our web application's apache/tomcat access logs.

Specifically we want to count, for a period, occurrences of several url patterns.

Example

We have two applications (call them 'foo' and 'bar')

These url's interest us. They indicate when users 'did interesting stuff'.

/foo/hop
/foo/skip
/foo/jump

/bar/crawl
/bar/walk
/bar/run

Basically we want to know for a given interval (10 minutes, hour, day, etc.) how many users: hopped,skipped,jumped,crawled, walked, etc.

Reference/Starting point

This article on importing access logs into RRDTool seemed like a helpful starting point. http://neidetcher.com/programming/2014/05/13/just-enough-rrdtool.html

However to clarify, this example uses the access log directly , whereas we want to a handful of url's 'in buckets' and count the 'number in each bucket'

Some Scripting Required..

I could do this with bash & grep & wc --iterating through the patterns, sending output to an 'intermediate results' text file....but believe RRDTool could do this with minimal 'outside coding'

That said, I believe RRDTool could do this with minimal 'outside coding'--but am unclear on the details.

Some points

  • I mention 'two applications' because we actually serve them up from separate servers with different log file formats. I'd like go get them into the same RRA file
  • Eventually I'd like to report this in cacti; initially however, I wanted to understand RRDTool details

  • Open to doing any coding, but would like to keep it as efficient as possible--both administratively and computer-resources. (By administratively, I mean: easy to monitor new instances)

  • I am very new to RRDTool and am RTM'ing . (and Walking through the Tutorial). I'm used to relational databases and spreadsheets, etc and don't have my mind around all the nuances of the RRA format.

Thanks in advance!

user331465
  • 2,984
  • 13
  • 47
  • 77

1 Answers1

0

You could setup a separate RRD file with ABSOLUTE type datasources for each address you want to track.

Then you tail the log file and whenever you see one of the interesting urls rush by you call:

rrdtool update url-xyz.rrd N:1

The ABSOLUTE data source type is like a counter, but it gets reset every time it is read. Your counter will just count to one, but that should not be a problem.

In the example above I am using N: and not the timestamp from the access log. You could also use that if you are not doing this in real time ... but beware that you can not update the same rrd file twice at the same time. N: will use milli timestamps internally and thus probably avoid this problem.

On the other hand it may make more sense to accumulate matching log entries with the same timestamp and only update rrdtool with that number once the timestamp on the logfile changes.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Tobi Oetiker
  • 5,167
  • 2
  • 17
  • 23
  • thanks. to clarify: 1) are you saying "use a separate .rrd file per url"? i.e. vs one rrd file with one data source per url? 2) I want to use the 'access log time' (versus "N" or current time). Can I use access log time directly? (I'm unclear as to whether I need to summarize per time-increment in advance or whether rrdtool can use 'access time' and summarize for free, i.e. via the RRA) rrdtool update url-foo.rrd <>:1 – user331465 Mar 13 '15 at 14:29
  • OK. As a follow on and at the risk of sounding very dense: it sounds a) I should summarize by interval in a script b) rrdtool cannot record the 'actual access log time' per occurrence, then sum it up by interval. I was hoping to 'record each entry' in the rrd database and use RRA/rrdtool similar to SQL's 'group by'. The only goal was to eliminate 'intermediate massaging' layer between 'access log' and 'rrdtool'. – user331465 Mar 13 '15 at 21:02
  • rrdtool never records time as such ... have you worked through the tutorial ... rrdtool has has not a sqlish bone in its body. – Tobi Oetiker Mar 14 '15 at 00:28