5

Is there any write-to-file buffer in bash programming? And if there is any, is it possible to change its size.

Here is the problem

I have a bash script which reads a file line by line then manipulates the read data and then write the result in to another file. Something like this

while read line 
  some grep, but and sed
  echo and append to another file

The input data is really huge (nearly 20GB of text file). The progress is slow so a question arise that if the default behavior of bash, is to write the result into the output file for each read line, then the progress will be slow.

So I want to know, is there any mechanism to buffer some outputs and then write that chunk to file? I searched on the internet about this issue but didn't find any useful information...

Is is an OS related question or bash? The OS is centos release 6.

The script is

#!/bin/bash
BENCH=$1
grep "CPU  0" $BENCH > `pwd`/$BENCH.cpu0
grep -oP '(?<=<[vp]:0x)[0-9a-z]+' `pwd`/$BENCH.cpu0 | sed 'N;s/\n/ /' |  tr '[:lower:]' '[:upper:]' > `pwd`/$BENCH.cpu0.data.VP
echo "grep done"
while read line ; do
   w1=`echo $line | cut -d ' ' -f1`
   w11=`echo "ibase=16; $w1" | bc`
   w2=`echo $line | cut -d ' ' -f2`
   w22=`echo "ibase=16; $w2" | bc`
   echo $w11 $w22 >> `pwd`/$BENCH.cpu0.data.VP.decimal
done <"`pwd`/$BENCH.cpu0.data.VP"
echo "convertion done"
mahmood
  • 23,197
  • 49
  • 147
  • 242
  • 1
    The kernel buffers writes, and your RAID card may also buffer writes. With an `echo` and append. You are opening and closing the file for each write. – jordanm May 29 '13 at 15:42
  • 1
    I wonder if it would be faster to just grep the file and pipe that to sed and whatnot. and pipe that to a file. – Markus Mikkolainen May 29 '13 at 15:43
  • @Markus Mikkolainen: They are already piped. – mahmood May 29 '13 at 15:45
  • `while read line` is dreadfully inefficient and forking greps and seds within the loop is even more costly. – msw May 29 '13 at 16:19

2 Answers2

5

Each echo and append in your loop are opening and closing the file which may have a negative impact on performance.

A likely better approach (and you should profile) is simply:

grep 'foo' | sed 's/bar/baz' | [any other stream operations] <$input_file >$output_file 

If you must keep the existing structure then an alternative approach would be to create a named pipe:

mkfifo buffer

Then create 2 processes: one which writes into the pipe, and one with reads from the pipe.

#proc1
while read line <$input_file; do
    grep foo | sed 's/bar/baz' >buffer
done


#proc2
while read line <buffer; do
    echo line >>$output_file
done

In reality I would expect the bottleneck to be entirely file IO, but this does create an independence between the reading and writing, which may be desirable.

If you have 20GB of RAM lying around, it may improve performance to use a memory mapped temporary file instead of a named pipe.

cmh
  • 10,612
  • 5
  • 30
  • 40
  • Can you give an example? This is a linux command. Do you mean `mkfifo buffer | ./run_prog` – mahmood May 29 '13 at 15:45
  • +1. Is there a way of creating a memory-mapped file using shell commands? – iruvar May 29 '13 at 16:07
  • Is it possible to use `echo $something > $buffer`? I get this error `$buffer: ambiguous redirect`. Can you test your script with `something=\`grep foo | sed 's/bar/baz'\`;echo $something > $buffer` – mahmood May 29 '13 at 16:13
  • @mahmood: use `echo $something >buffer" if the pipe is literally named `buffer`. Use `$buffer` if `$buffer` is a variable storing the name of you pipe. I will clear up the question. – cmh May 29 '13 at 16:16
  • ALthough this addresses the question, it's kinda of a "sledgehammer to swat flies" approach. *added* oh, I see you added the pipline suggestion which is really how it ought be done. – msw May 29 '13 at 16:18
  • @1_CR, I'm not aware of anything unfortunately. I wouldn't be surprised if there was a wrapper somewhere. – cmh May 29 '13 at 16:18
  • @mahmood, both of my examples are not complete runnable scripts, merely snippets to help you understand what you need to do to your own code. As msw has demonstrated with his benchmark, you should almost certainly pursue the first approach (stream processing). – cmh May 29 '13 at 17:03
3

Just to see what the differences were, I created a file containing a bunch of

a somewhat long string followed by a number: 0000001

Containing 10,000 lines (about 50MiB) and then ran it through a shell read loop

while read line ; do
  echo $line | grep '00$' | cut -d " " -f9 | sed 's/^00*//'
done < data > data.out

Which took almost 6 minutes. Compared with the equivalent

grep '00$' data | cut -d " " -f9 | sed 's/^00*//' > data.fast

which took 0.2 seconds. To remove the cost of the forking, I tested

while read line ; do
  :
done < data > data.null

where : is a shell built-in which does nothing at all. As expected data.null had no contents and the loop still took 21 seconds to run through my small file. I wanted to test against a 20GB input file, but I'm not that patient.

Conclusion: learn how to use awk or perl because you will wait forever if you try to use the script you posted while I was writing this.

msw
  • 42,753
  • 9
  • 87
  • 112
  • You could do your benchmark all in `sed` although it's slightly ugly; `sed '/00/s/^[^ ]* [^ ]* [^ ]* ^[^ ]* [^ ]* [^ ]* ^[^ ]* [^ ]* \(00*\)\?//' data > data.fast` – tripleee May 29 '13 at 16:59
  • As I based my example on the pseudo-code that the questioner first posted, I had to make things up. I expect you know the general principle holds especially since his real script is even more forkful. – msw May 29 '13 at 17:05
  • Sorry, I'm late to the party. If you strace the script with `while read line; do ... done` you'll see that bash implements its read function as numerous `read` system calls with single byte buffer. On the other hand `grep` buffers it input (on my system its input buffer is 4Kb). Thus, ti read 4Kb of input bash would switch context user-kernel-user 4096 times, whereas `grep` will change contex only once. Moreover, those processes inside a loop live only inside a loop and each time bash reads a new line the full set of processes (+1 for echo) is created. This is very large overhead. – Sergey Kanaev Apr 14 '15 at 08:13