Lightweight alternative to R for RHEL?

Question

I want to use R for some statistical analysis of logfile information, but found that even the "limited" R-core RPM has a lot of dependencies not already installed. I don't want to install so many packages for a peripheral need.

Are there lightweight alternatives for simple statistical analysis on RHEL 6? I have an R script that accepts on stdin a large set of values -- one value per line -- and prints out the min, max, mean, median, 95th percentile, and standard deviation.

For more context, I'm using grep and awk to find GET requests for a particular path in our webserver log files, get the response times, and calculate the metrics listed above in order to measure the impact on performance of changes to a web application.

I don't need any graphing capabilities, just simple computation. Is there something I've overlooked?

Python? Ruby? Possibly even `awk` could get you what you want. — larsks, Mar 19 '12 at 18:36

score 2 · Answer 1 · answered Mar 19 '12 at 18:47

Here's min, max, total, mean, and median in awk:

BEGIN {
    min="unset"
    max=0
}


{
    values[NR] = $1

    total += $1
    average = total/NR

    if ($1 > max) max = $1
    if (min == "unset" || $1 < $min) min = $1
}

END {
    median=values[int(NR/2)]

    print "MIN:", min
    print "MAX:", max
    print "TOTAL:", total
    print "MEAN:", average
    print "MEDIAN:", median
}

Standard deviation and 95th percentile are left as an exercise to the reader.

score 1 · Answer 2 · answered Mar 19 '12 at 18:36

1

Any programming/scripting language like Perl, Python or Ruby will do this easily, and bc is also available.

answered Mar 19 '12 at 18:36

Sven

98,649
14
180
226

1

It would be more helpful to provide [an example](http://serverfault.com/a/371285/93109) in any or all of those languages. – aculich Mar 19 '12 at 20:15
@aculich: Sorry, but that's what Google (or [SO]) is for. The fact that I know that it's easily done (and also is well documented) doesn't mean that I do this every day and can construct an example in a time frame acceptable for this type of question. Also: I fully answered the question, as an example was never asked. – Sven Mar 19 '12 at 20:21

score 1 · Accepted Answer · edited Dec 04 '22 at 16:36

1

Use the python NumPy package which you should be able to easily install using yum install numpy or pip install numpy:

import numpy
n = numpy.random.rand(100)
print 'min:', n.min()
print 'max:', n.max()
print 'mean:', n.mean()
print 'median:', numpy.median(n)
print '95th:', numpy.percentile(n, 95)
print 'stddev:', n.std()

This will save you from re-implementing basic statistics from scratch. In general, Python with NumPy and SciPy is a feature-rich alternative to R that also often has better performance.

Also, rather than writing your own log file parsing with grep and awk you can use something like: pylogsparser which is "a log parser library packaged with a set of ready to use parsers (DHCPd, Squid, Apache, ...)".

edited Dec 04 '22 at 16:36

Glorfindel

1,213
4
15
22

answered Mar 19 '12 at 20:02

aculich

3,610
1
26
33

Well, that's why I suggested `python` as my first choice in the comment above :). The `awk` implementation was really just for the fun of it. – larsks Mar 19 '12 at 20:11
hehe... don't you mean for the *pain* of it!? `awk`, for the masochist in you! :P – aculich Mar 19 '12 at 20:13
NumPy did the trick. I'm not familiar with Python, and was uanble to get numpy.percentile(n, 95) working. Instead, I used the stats package from SciPy for the scoreatpercentile(n, 95) method to get the 95th percentile value. – Eric Rath Apr 19 '12 at 18:20

Lightweight alternative to R for RHEL?

3 Answers3