In trying to figure out if AWK or MySQL is more efficient for processing log files and returning aggregate stats, I noticed the following behavior which doesn't make sense to me:
To test this I used a file that had 4 columns and approximately 9 Million records. I used the same server, which is a VPS with a SSD and 1GB of RAM.
column1 is a column that has about 10 unique values and the number of total unique values for the combination of all columns is approximately 4k.
In MySQL I use a table defined as table (column1, column2, column3, column4) with no indices.
Data Format:
column1,column2,column3,column4 column1,column2,column3,column4
AWK Script:
BEGIN {
FS = ",";
time = systime();
} {
array[$1]++; #first test
#array[$1 "," $2 "," $3 "," $4]++; #second test
}
} END {
for (value in array) {
print "array[" value "]=" array[value];
}
}
MySQL Query:
Query 1: SELECT column1, count(*) FROM log_test GROUP BY column1;
Query 2: SELECT column1, column2, column3, column4, count(*)
FROM log_test GROUP BY column1, column2, column3, column4;
AWK is slower than MySQL as expected. However, when I run the first test that returns the aggregate data with 10 lines MySQL takes around 7 secs to finish and AWK takes around 22 secs.
I understand that awk reads line by line and processes the data, so I would expect that when I run the second test, that has an output of 4k lines, AWK should take about the same time as it did for the first test, being that it still has the same number of lines to read and it isnt doing much more processing. However AWK takes about 90 secs but only uses .1% MEM while MySQL takes about 45 secs and uses 3% MEM.
- Why does AWK take so much longer for test 2 than test 1 when it is essentially reading the same file?
- Why does AWK not use more memory and is awk storing its values on the harddisk vice in memory?
- Why is MySQL so much faster when it essentially has to read the table line by line as well?
- Are there more efficient alternatives to aggregating this data?