0

I ran the following command :

time for i in {1..100}; do find / -name "*.service" | wc -l; done

got a 100 lines of the result then :

real 0m35.466s user 0m15.688s sys 0m14.552s

I then ran the following command :

time for i in {1..100}; do find / -name "*.service" | awk 'END{print NR}'; done

got a 100 lines of the result then :

real 0m35.036s user 0m15.848s sys 0m14.056s

I precise I already ran find / -name "*.service" just before so it was cached for both commands.

I expected wc -l to be faster. Why is it not ?

shrimpdrake
  • 1,476
  • 2
  • 14
  • 25
  • The initial `find` was caching direntries (inodes), not contents. – Charles Duffy Apr 12 '17 at 20:36
  • The result of `find` isn't going to be cached. – Oliver Charlesworth Apr 12 '17 at 20:36
  • @OliverCharlesworth, ...well, it is *slightly* -- the dentry cache will at least mean the filesystem doesn't need to go back to underlying media. It's still a new `find` instance each time talking to the filesystem each time. – Charles Duffy Apr 12 '17 at 20:37
  • @CharlesDuffy - Sure, agree that the relevant part of the filesystem will be cached :) – Oliver Charlesworth Apr 12 '17 at 20:39
  • 1
    @shrimpdrake, ...that said, even with stronger supporting benchmarks, I'm not sure that this is defensibly topical here. `wc` is by no means "unique to software development", as described in http://stackoverflow.com/help/on-topic -- and since there aren't documented performance guarantees for either tool, and *also* doesn't exist a single canonical implementation, this isn't really a question for which a canonical answer exists. – Charles Duffy Apr 12 '17 at 20:39
  • @shrimpdrake, ...definitely, though, `find` is the biggest wildcard -- copy its output to a file (ideally on tmpfs) before trying something like this again. I'd also suggest looking through `strace` output for each -- if they're reading with different block sizes, that's liable to be the whole of your delta (should it still exist at all). Context switches between the kernel and userland are expensive. – Charles Duffy Apr 12 '17 at 20:43
  • 1
    (It might also be interesting to know if `LC_CTYPE=C`, turning off multi-byte character support, has any effect). – Charles Duffy Apr 12 '17 at 20:44
  • 2
    Since you're timing the whole loop, the bulk of the time is probably in the `find` and the difference between the `wc` and `awk` is more or less lost in the noise. – Jonathan Leffler Apr 12 '17 at 20:46
  • You are launching `find` 100 times. Unless there are a lot of files found, you are basically benchmarking the filesystem, IO and process launch overhead much more than you are benchmarking `wc` or `awk`. Even if there are a lot of files found, `find` has more work to do than `wc` or `awk` anyway. `awk` and `wc` probably spend more time being started than counting lines. – Fred Apr 12 '17 at 21:16
  • How many lines is `find` producing for you to count? @Fred is likely correct that `awk` and `wc` have minimal work to do. Try counting the lines from a large file, instead. – ewindes Apr 12 '17 at 21:31
  • thanks for those answers, find was effectively the wildcard here, I put a big txt file of 14M lines in /media/virtuelram/ and executed both "time for i in {1..100}; do awk 'END{ print NR}' /media/virtuelram/rockyou.txt; done" and "time for i in {1..100}; do wc -l /media/virtuelram/rockyou.txt; done" and the results were quite the opposite : 13.027s for wc and 80.207s for awk ! – shrimpdrake Apr 12 '17 at 22:30
  • It's not `find` that caused it to deviate. Try running any process 100 times and time each one in a loop and you will be surprised the output is different. Say `for i in {1..100}; do time ls -l > /dev/null 2>&1; done`. CPU scheduling is what's causing it. – alvits Apr 13 '17 at 01:22

2 Answers2

2

other's have mentioned that you're probably timing find, not wc or awk. still, there may be interesting differences to explore between wc and awk in their various flavors.

here are the results I get:

Mac OS 10.10.5 awk    0.16m lines/second
GNU awk/gawk 4.1.4    4.4m  lines/second
Mac OS 10.10.5 wc     6.8m  lines/second
GNU wc 8.27          11m    lines/second

i didn't use find, but instead used wc -l or `awk 'END{print NR}' on a large text file (66k lines) in a loop.

i varied the order of the commands and didn't find any deviations large enough to change the rankings i reported.

LC_CTYPE=C had no measurable effect on any of these.

conclusions

  1. don't use mac builtin command line tools except for trivial amounts of data.

  2. GNU wc is faster than GNU awk at counting lines.

i use MacPorts GNU binaries. it would be interesting to see how Homebrew binaries compare. (i'm guessing they'd lose.)

webb
  • 4,180
  • 1
  • 17
  • 26
1

Three things:

  1. Such a small difference is usually not significant:

    0m35.466s - 0m35.036s = 0m0.43s  or 1.2%
    
  2. Yet wc -l is faster (10x) than awk 'END{print NR}'.

    % time seq 100000000  | awk 'END{print NR}' > /dev/null
    
    real    0m13.624s
    user    0m14.656s
    sys 0m1.047s
    % time seq 100000000  | wc -l > /dev/null
    
    real    0m1.604s
    user    0m2.413s
    sys 0m0.623s
    
  3. My guess is that the hard drive cache holds the find results, so after the first run with wc -l, most of the reads needed for find are in the cache. Presumably the difference in times between the initial find with disk reads and the second find with cache reads, would be greater than the difference in run times between awk and wc.

    One way to test this is to reboot, which clears the hard disk cache, then run the two tests again, but in the reverse order, so that awk is run first. I'd expect that the first-run awk would be even slower than the first-run wc, and the second-run wc would be faster than the second-run awk.

agc
  • 7,973
  • 2
  • 29
  • 50
  • Almost as slow as `awk` is `sed`. `time seq 100000000 | sed -n '$=' > /dev/null` comes in at *12s*. – agc Apr 13 '17 at 14:27