0

I want to download webpages recursively and pipe the output to a filter. I am using:

wget -qm -O- http://mywebsite.com/initialpath.php | ./filter

But wget stops downloading after the first page and waits for input instead of parsing the webpage and downloading linked files. It works if I save the output to a file with -O filename but I want to handle the webpages on the fly with a filter.

How can I achieve this?

chqrlie
  • 111
  • 4

1 Answers1

1

I does not seem possible to achieve my goal with current versions of wget.

After studying the source code for wget version 1.18, I came to these conclusions:

  • wget cannot recurse if it does not store the downloaded files, at least temporarily as for --spider.

  • When passed -O filename, it keeps appending to filename and reparses the whole file after each download, loading it completely in memory (or mapping it). This is very cumbersome and inefficient.

  • When passed -O-, it pipes the downloaded file to stdout and attempts to reload - to look for more urls to fetch... Which causes stdin to be read for this purpose. This is a side effect of the implementation.

I wrote a patch to add a more sensible piping option, relying on --spider to download html and css files for recursive operation and piping only these files before they are removed. I will publish the patch when it is reasonably tested and documented.

chqrlie
  • 111
  • 4