wget does not recurse when piping the output to stdout

Question

I want to download webpages recursively and pipe the output to a filter. I am using:

wget -qm -O- http://mywebsite.com/initialpath.php | ./filter

But wget stops downloading after the first page and waits for input instead of parsing the webpage and downloading linked files. It works if I save the output to a file with -O filename but I want to handle the webpages on the fly with a filter.

How can I achieve this?

I am sure... I studied the source code for `wget` and found the explanation. — chqrlie, Oct 30 '21 at 09:52

score 1 · Accepted Answer · answered Oct 30 '21 at 10:02

I does not seem possible to achieve my goal with current versions of wget.

After studying the source code for wget version 1.18, I came to these conclusions:

wget cannot recurse if it does not store the downloaded files, at least temporarily as for --spider.
When passed -O filename, it keeps appending to filename and reparses the whole file after each download, loading it completely in memory (or mapping it). This is very cumbersome and inefficient.
When passed -O-, it pipes the downloaded file to stdout and attempts to reload - to look for more urls to fetch... Which causes stdin to be read for this purpose. This is a side effect of the implementation.

I wrote a patch to add a more sensible piping option, relying on --spider to download html and css files for recursive operation and piping only these files before they are removed. I will publish the patch when it is reasonably tested and documented.

wget does not recurse when piping the output to stdout

1 Answers1