2

I want to use R for some statistical analysis of logfile information, but found that even the "limited" R-core RPM has a lot of dependencies not already installed. I don't want to install so many packages for a peripheral need.

Are there lightweight alternatives for simple statistical analysis on RHEL 6? I have an R script that accepts on stdin a large set of values -- one value per line -- and prints out the min, max, mean, median, 95th percentile, and standard deviation.

For more context, I'm using grep and awk to find GET requests for a particular path in our webserver log files, get the response times, and calculate the metrics listed above in order to measure the impact on performance of changes to a web application.

I don't need any graphing capabilities, just simple computation. Is there something I've overlooked?

Eric Rath
  • 483
  • 1
  • 5
  • 11

3 Answers3

2

Here's min, max, total, mean, and median in awk:

BEGIN {
    min="unset"
    max=0
}


{
    values[NR] = $1

    total += $1
    average = total/NR

    if ($1 > max) max = $1
    if (min == "unset" || $1 < $min) min = $1
}

END {
    median=values[int(NR/2)]

    print "MIN:", min
    print "MAX:", max
    print "TOTAL:", total
    print "MEAN:", average
    print "MEDIAN:", median
}

Standard deviation and 95th percentile are left as an exercise to the reader.

larsks
  • 43,623
  • 14
  • 121
  • 180
1

Any programming/scripting language like Perl, Python or Ruby will do this easily, and bc is also available.

Sven
  • 98,649
  • 14
  • 180
  • 226
  • 1
    It would be more helpful to provide [an example](http://serverfault.com/a/371285/93109) in any or all of those languages. – aculich Mar 19 '12 at 20:15
  • @aculich: Sorry, but that's what Google (or [SO]) is for. The fact that I know that it's easily done (and also is well documented) doesn't mean that I do this every day and can construct an example in a time frame acceptable for this type of question. Also: I fully answered the question, as an example was never asked. – Sven Mar 19 '12 at 20:21
1

Use the python NumPy package which you should be able to easily install using yum install numpy or pip install numpy:

import numpy
n = numpy.random.rand(100)
print 'min:', n.min()
print 'max:', n.max()
print 'mean:', n.mean()
print 'median:', numpy.median(n)
print '95th:', numpy.percentile(n, 95)
print 'stddev:', n.std()

This will save you from re-implementing basic statistics from scratch. In general, Python with NumPy and SciPy is a feature-rich alternative to R that also often has better performance.

Also, rather than writing your own log file parsing with grep and awk you can use something like: pylogsparser which is "a log parser library packaged with a set of ready to use parsers (DHCPd, Squid, Apache, ...)".

Glorfindel
  • 1,213
  • 4
  • 15
  • 22
aculich
  • 3,610
  • 1
  • 26
  • 33
  • Well, that's why I suggested `python` as my first choice in the comment above :). The `awk` implementation was really just for the fun of it. – larsks Mar 19 '12 at 20:11
  • hehe... don't you mean for the *pain* of it!? `awk`, for the masochist in you! :P – aculich Mar 19 '12 at 20:13
  • NumPy did the trick. I'm not familiar with Python, and was uanble to get numpy.percentile(n, 95) working. Instead, I used the stats package from SciPy for the scoreatpercentile(n, 95) method to get the 95th percentile value. – Eric Rath Apr 19 '12 at 18:20