0

Debian's Bash manual suggest using the special command substitution $(< file) whenever $(cat file) would be required, for the sake of performance, avoiding the execution of an external binary.

However, the measured completion time for the following code is about the same:

time for i in {0..1000}; do echo str | { in=$(cat); }; done
time for i in {0..1000}; do echo str | { in=$(< /dev/fd/0); }; done

Over a few runs, they consistently return values around these figures, respectively:

real    0m3.665s
user    0m0.365s
sys     0m0.782s

and

real    0m2.401s
user    0m0.233s
sys     0m0.533s

So the improvement of command substitution over cat is largely negligible for most use cases. Since my script needs to quickly and cyclically read large amounts of stdin, what can I do to accelerate these readings? In particular, the whole stream of stdin data needs to be dumped into a Bash variable for further parameter substitutions.

Further testing:

After the comments below and further testing, I set 10,000 iterations instead of 1000 to minimize the pipe setup overhead, and I deleted the brackets for the compound command syntax:

$ time for i in {1..10000}; do echo str | in=$(cat); done

real    0m24.754s
user    0m6.958s
sys     0m18.996s

$ time for i in {1..10000}; do echo str | in=$(< /dev/fd/0); done

real    0m33.913s
user    0m3.736s
sys     0m10.516s

Here I am unable to explain why $(< /dev/fd/0) is even slower now.

davide
  • 2,082
  • 3
  • 21
  • 30
  • 3
    I'm confused. Your benchmark shows a 33% speedup. – John Kugelman Apr 21 '15 at 17:25
  • ...and that quick benchmark is largely measuring the time needed to set up a pipeline, as opposed to the performance of reading stdin. Maybe if you showed us your *actual* code, we could try to help with optimizing it, but as it is, the numbers you're seeing look eminently reasonable. – Charles Duffy Apr 21 '15 at 17:26
  • For an integrated Bash call I was expecting something largely below the millisecond. – davide Apr 21 '15 at 17:28
  • @davide, but you're doing much more than just calling a builtin in this code. You're setting up a pipeline! Pipelines are expensive! – Charles Duffy Apr 21 '15 at 17:30
  • @davide, that is to say, you should also measure the performance of `time for i in {0..1000}; do echo str | true; done`, to measure the other components of what this code is doing, and consider that a baseline. – Charles Duffy Apr 21 '15 at 17:31
  • @JohnKugelman, ...so it's actually closer to a 3x speedup when factoring out constant-time costs. – Charles Duffy Apr 21 '15 at 17:36
  • I get your point. I measured the same code over 10.000 cycles instead of the original 1000. The figures I get are approx 34 seconds for `cat` and 30 for `$(<)`, so the overhead to set up the pipe seems the minor among the other costs. – davide Apr 21 '15 at 17:45
  • re: "the other costs" -- if you haven't done that same test with `true` (no reading stdin at all), then you don't know how much goes into them. – Charles Duffy Apr 21 '15 at 17:53
  • If performance is an issue, why are you using `bash` at all? – chepner Apr 21 '15 at 17:54
  • 1
    Indeed. Even if one wanted to use a shell language, ksh93 (real David Korne ksh, not the clones) is far, far faster. And as much as I despise it for its intentional noncompliance with POSIX, so is zsh. – Charles Duffy Apr 21 '15 at 17:54
  • 1
    How do you think increasing the number of iterations decreases that overhead? You're still setting up a pipeline once per iteration, since the pipeline is _inside_ your loop. Decreasing the brackets is irrelevant -- it's the thousands of subshells for the thousands of pipelines that are expensive. – Charles Duffy Apr 21 '15 at 18:02
  • 1
    Holy cow, that's true! I didn't realize the pipe was being created over each iteration. Moving the pipe out of the loop, after `done`, terribly improves performance. So yes, the pipe is an expensive gadget. – davide Apr 21 '15 at 18:06

1 Answers1

3

You're forgetting to factor out performance costs unrelated to reading from stdin (the costs of fork()ing to create a subshell, setting up a pipeline, wait()ing for those processes to exit, etc).

$ time for i in {0..1000}; do echo str | { in=$(cat); }; done
real    0m3.183s
user    0m1.427s
sys     0m2.486s

$ time for i in {0..1000}; do echo str | { in=$(< /dev/fd/0); }; done
real    0m1.973s
user    0m0.917s
sys     0m1.844s

$ time for i in {0..1000}; do echo str | true; done
real    0m1.294s
user    0m0.708s
sys     0m1.367s

Thus:

  • Using $(cat) adds approximately (3.183s - 1.294s == 1.889s) wall-clock time over 1000 iterations, compared to the code that does all other setup but doesn't read stdin.
  • Using $(</dev/fd/0) adds approximately (1.972s - 1.294s == 0.697s) over 1000 iterations.

This is a 2.7x improvement, and well under the 1ms per invocation you expect.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441