1

I am writing a script that must loop, each loop different scripts pull variables from external files and the last step compiles them. I am trying to maximize the speed at which this loop can run, and thus trying to find the best programs for the job.

The rate limiting step right now is searching through a file which has 2 columns and 4.5 million lines. column one is a key and column 2 is the value I am extracting.

The two programs I am evaluating are awk and grep. I have put the two scripts and their run times to find the last value below.

time awk -v a=15 'BEGIN{B=10000000}$1==a{print $2;B=NR}NR>B{exit}' infile

T

real    0m2.255s
user    0m2.237s
sys     0m0.018s

time grep "^15 " infile |cut -d " " -f 2

T

real    0m0.164s
user    0m0.127s
sys     0m0.037s

This brings me to my question... how does grep search. I understand awk runs line by line and field by field, which is why it takes longer as the file gets longer and i have to search further into it.

how does grep search? Clearly not line by line, or if it is it's clearly in a much different manner than awk considering the almost 20x time difference.

(I have noticed awk runs faster than grep for short files and I've yet to try and find where they diverge, but for those sizes it really doesn't matter nearly as much!).

I'd like to understand this so I can make good decisions for future program usage.

jeffpkamp
  • 2,732
  • 2
  • 27
  • 51
  • I don't see how this is a dup. The linked answer is about shell expansion of '*', not how grep works. In any case, grep is line oriented. – copper.hat Jul 01 '14 at 17:04
  • @jaypal This is not a duplication of that question. He was asking how Grep interpreted an argument. I'm asking how its search function works. – jeffpkamp Jul 01 '14 at 17:08
  • @jeffpkamp I know, if you look at the second answer, the first two paras might help you. If not, I will retract my vote. – jaypal singh Jul 01 '14 at 17:12
  • @jaypal I saw that and that's partly why I asked my question as his answer was quite confusing to me and partially wrong. I was hoping for a clearer answer. – jeffpkamp Jul 01 '14 at 17:16
  • @jeffpkamp There you go and you might want to edit your question as the title doesn't reflect the question you pose inside the body. – jaypal singh Jul 01 '14 at 17:17
  • If you are writing a shell loop just to parse text files you almost certainly have the wrong approach. Also, wrt making good decisions, just worry about performance IF you have a specific problem after consider all the other good software practices. – Ed Morton Jul 01 '14 at 17:54

1 Answers1

1

The awk command you posted does far more than the grep+cut:

awk -v a=15 'BEGIN{B=10000000}$1==a{print $2;B=NR}NR>B{exit}' infile
grep "^15 " infile |cut -d " " -f 2

so a time difference is very understandable. Try this awk command, which IS equivalent to the grep+cut, and see what results you get so we can compare apples to apples:

awk '/^15 /{print $2}' infile

or even:

awk '$1==15{print $2}' infile
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • that gets it closer, AWK time is now .5 seconds vs .164 for grep. Grep is still faster in this case, but now only about 3x. Does this mean regular expressions are evaluated differently than a "==" expression? – jeffpkamp Jul 01 '14 at 17:59
  • Yes, certainly. String comparison is very different from RE comparison, almost certainly a completely different section of code. Talking of which, try `awk '$1==15{print $2}'`. – Ed Morton Jul 01 '14 at 18:13
  • I had done that one, that takes 1.7 seconds, clearly slower. I did `awk 'END{print}'` in file to see how fast it could read through the file. Looks like the fastest it can run is .2-.5 seconds. This is still slower than grep, though just barely. comparing `grep "^450" infile` to `awk '\^450\' infile ` gives .3 on average for grep and .6 on average for awk. I guess these are much more comparable, though grep still seems to be a bit faster for this type of search. – jeffpkamp Jul 01 '14 at 18:31
  • Did you try the tests a few times to eliminate cache-ing impact? Of course. grep just matches an RE while awk splits each record into fields before even attempting a match. In the grep+cut solution, the cut is only being applied to the lines that matched the RE. As you've noticed though, with equivalent code you're comparing the blink of an eye to a slightly faster blink of an eye so the performance impact is negligible and all of the other good software practices should be given priority. – Ed Morton Jul 01 '14 at 18:37
  • 1
    yeah, I ran the test 10 in a row and averaged in my head. I figured the difference was due to overhead for all the extra stuff that AWK was doing. You've been helpful as always Ed. Thanks. – jeffpkamp Jul 01 '14 at 18:45