52

I cannot get wget to mirror a section of a website (a folder path below root) - it only seems to work from the website homepage.

I've tried many options - here is one example

wget -rkp -l3 -np  http://somewebsite/subpath/down/here/

While I only want to mirror the content links below that URL - I also need to download all the page assets which are not in that path.

It seems to work fine for the homepage (/) but I can't get it going for any sub folders.

kenorb
  • 155,785
  • 88
  • 678
  • 743
sub
  • 521
  • 1
  • 5
  • 3

5 Answers5

89

Use the --mirror (-m) and --no-parent (-np) options, plus a few of cool ones, like in this example:

wget --mirror --page-requisites --adjust-extension --no-parent --convert-links
     --directory-prefix=sousers http://stackoverflow.com/users
kenorb
  • 155,785
  • 88
  • 678
  • 743
Attilio
  • 2,121
  • 2
  • 21
  • 20
  • 25
    To save anyone else searching the wget manual, -p is --page-requisites and -P is --directory-prefix – Alf Eaton Oct 03 '12 at 10:11
  • 4
    Just as a note for others who might bump into this issue where the most commonly downloaded wget binary for Windows 7 seems to be gnuwin32 packages from sourceforge.net, but those are wget-1.11 which do not have the --adjust-extension functionality. It was apparently added only in wget-1.12. So Windows 7 users can get a much newer and self-contained binary from here instead (http://eternallybored.org/misc/wget/) – bdutta74 Mar 05 '14 at 06:56
19

I usually use:

wget -m -np -p $url
ninjalj
  • 42,493
  • 9
  • 106
  • 148
  • 2
    `-p` to download everything required to display a page is useful. Does that override `-np` for those elements necessary to display a page? – Geremia Feb 11 '16 at 16:46
  • For informational purposes only: `-m` = mirror, `-np` = no parent (don't retrieve files higher up in the hierarchy when recursing), `-p` = page requisites or all items necessary to appropriately display the web page. – Shrout1 Mar 16 '17 at 20:07
3

I use pavuk to accomplish mirrors, as it seemed much better for this purpose just from the beginning. You can use something like this:

/usr/bin/pavuk -enable_js -fnrules F '*.php?*' '%o.php' -tr_str_str '?' '_questionmark_' \
               -norobots -dont_limit_inlines -dont_leave_dir \
               http://www.example.com/some_directory/ >OUT 2>ERR
rubo77
  • 19,527
  • 31
  • 134
  • 226
Tomas
  • 57,621
  • 49
  • 238
  • 373
2

For my use case the no parent option didn't quite work.

I was trying to mirror https://www.example.com/section and URLs under it like https://www.example.com/section/subsection. This can't be done with --no-parent because if you start at /section then it'll download the entire site if you start at /section/ then the site redirects to /section and now it's at parent so wget stops. Fun.

Instead, I am using --acept-regex 'https://www.example.com/(section|assets/).*'. This worked. (Although it would download sectionfoobar but this was acceptable for me and now we are wandering into regexp territory which is amply covered elsewhere on SO.)

chx
  • 11,270
  • 7
  • 55
  • 129
0

Check out archivebox.io, it's an open-source, self-hosted tool that creates a local, static, browsable HTML clone of websites (it saves HTML, JS, media files, PDFs, screenshot, static assets and more).

By default, it only archives the URL you specify, but we're adding a --depth=n flag soon that will let you recursively archive links from the given URL.

Nick Sweeting
  • 5,364
  • 5
  • 27
  • 37