How does data get processed across pipes?

Question

I used this command-line program that I found in another post on SO describing how to spider a website.

wget --spider --force-html -r -l2 http://example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\)$' > wget.out

When I crawl a large site, it takes a long time to finish. Meanwhile the wget.out file on disk shows zero size. So when does the piped data get processed and written to the file on disk? Is it after each stage in the pipe having run to completion? In that case, will wget.out fill up after the entire crawling is over?

How do I make the program write intermittently to disk, so that, even if the crawling stage is interrupted, I have some output saved ?

possible duplicate of [Turn off buffering in pipe](http://stackoverflow.com/questions/1000674/turn-off-buffering-in-pipe) — eumiro, Jan 24 '11 at 11:10

bobbogo · Accepted Answer · 2011-01-25T11:45:24.917

There is buffering in each pipe, and maybe in the stdio layers of each program. Data will not make it to the disk until the final grep has processed enough lines to cause its buffers to fill to the point of being spilled to disk.

If you run your pipeline on the command-line, and then hit Ctrl-C, sigint will be sent to every process, terminating each, and losing any pending output.

Either:

Ignore sigint in all processes but the first. Bash hackery follows:

$ wget --spider --force-html -r -l2 http://example.com 2>&1 grep '^--' |
    { trap '' int; awk '{ print $3 }'; } |
    ∶

Simply deliver the keyboard interrupt to the first process. Interactively you can discover the pid with jobs -l and then kill that. (Run the pipeline in the background.)
```
$ jobs -l
[1]+ 10864 Running          wget
   3364 Running             | grep
  13500 Running             | awk
∶
$ kill -int 10864
```
Play around with the disown bash builtin.

How does data get processed across pipes?

1 Answers1