8

Say I have a file named jobs.csv and I would like to get the top 50k jobs done by Foo

I can either do:

# cat jobs.csv | sort -u | head -n 50000 > /tmp/jobs.csv
# cat /tmp/jobs.csv | while read line; do Foo --job=$line; done

Or

# cat jobs.csv | sort -u | head -n 50000 | while read line; do Foo --job=$line; done 

Can one tell which one is better in terms of system's IO/Memory efficiency?

Or even better, can one come up with a better solution for this?

masegaloeh
  • 18,236
  • 10
  • 57
  • 106
Tzury Bar Yochay
  • 727
  • 11
  • 24

1 Answers1

4

I normally go for the second option (pipes all the way) unless one of the intermediate outputs is useful to me for another task. For example, if after running Foo against 50k jobs, you then wanted to run Bar against the same jobs, it would be useful to have /tmp/jobs.csv available.

Using pipes all the way gives the system the ability to forget about data at the earliest possible time, so it is a more efficient use of memory. It also bypasses the VFS and tmpfs stacks and so it uses marginally less CPU. The overall performance of the chain is faster as well because you don't need to wait for one step to finish before starting the next step (except if the particular program requires it).

By the way, in your example the biggest user of memory would be the sort stage because it needs to keep the entire contents of jobs.csv in memory in order to sort it. You can make it more efficient by improving whatever creates jobs.csv in the first place so that you no longer need the sort -u.

Tom Shaw
  • 3,752
  • 16
  • 23
  • "You can make it more efficient by improving whatever creates jobs.csv in the first place so that you no longer need the sort -u" I wish I had control, at any extent, on external data sources ;-) – Tzury Bar Yochay Jun 05 '11 at 10:51
  • 1
    If you are indeed using pipes all the way and it's a bash script be sure to set `pipefail`: To quote the manpage "If pipefail is enabled, the pipeline's return status is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands exit successfully" Without that you might be scratching your head why your script exits with 0 but still produces bogus – serverhorror Jun 05 '11 at 12:13