How to mirror only a section of a website?

Question

I cannot get wget to mirror a section of a website (a folder path below root) - it only seems to work from the website homepage.

I've tried many options - here is one example

wget -rkp -l3 -np  http://somewebsite/subpath/down/here/

While I only want to mirror the content links below that URL - I also need to download all the page assets which are not in that path.

It seems to work fine for the homepage (/) but I can't get it going for any sub folders.

score 89 · Answer 1 · edited Jun 03 '16 at 21:45

89

Use the --mirror (-m) and --no-parent (-np) options, plus a few of cool ones, like in this example:

wget --mirror --page-requisites --adjust-extension --no-parent --convert-links
     --directory-prefix=sousers http://stackoverflow.com/users

edited Jun 03 '16 at 21:45

kenorb

155,785
88
678
743

answered Aug 04 '11 at 04:14

Attilio

2,121
2
21
20

25

To save anyone else searching the wget manual, -p is --page-requisites and -P is --directory-prefix – Alf Eaton Oct 03 '12 at 10:11
4

Just as a note for others who might bump into this issue where the most commonly downloaded wget binary for Windows 7 seems to be gnuwin32 packages from sourceforge.net, but those are wget-1.11 which do not have the --adjust-extension functionality. It was apparently added only in wget-1.12. So Windows 7 users can get a much newer and self-contained binary from here instead (http://eternallybored.org/misc/wget/) – bdutta74 Mar 05 '14 at 06:56

score 19 · Answer 2 · answered May 26 '11 at 22:11

19

I usually use:

wget -m -np -p $url

answered May 26 '11 at 22:11

ninjalj

42,493
9
106
148

2

`-p` to download everything required to display a page is useful. Does that override `-np` for those elements necessary to display a page? – Geremia Feb 11 '16 at 16:46
For informational purposes only: `-m` = mirror, `-np` = no parent (don't retrieve files higher up in the hierarchy when recursing), `-p` = page requisites or all items necessary to appropriately display the web page. – Shrout1 Mar 16 '17 at 20:07

score 3 · Answer 3 · edited Mar 24 '17 at 08:11

3

I use pavuk to accomplish mirrors, as it seemed much better for this purpose just from the beginning. You can use something like this:

/usr/bin/pavuk -enable_js -fnrules F '*.php?*' '%o.php' -tr_str_str '?' '_questionmark_' \
               -norobots -dont_limit_inlines -dont_leave_dir \
               http://www.example.com/some_directory/ >OUT 2>ERR

edited Mar 24 '17 at 08:11

rubo77

19,527
31
134
226

answered Jul 23 '11 at 13:22

Tomas

57,621
49
238
373

score 2 · Answer 4 · answered May 24 '22 at 07:16

For my use case the no parent option didn't quite work.

I was trying to mirror https://www.example.com/section and URLs under it like https://www.example.com/section/subsection. This can't be done with --no-parent because if you start at /section then it'll download the entire site if you start at /section/ then the site redirects to /section and now it's at parent so wget stops. Fun.

Instead, I am using --acept-regex 'https://www.example.com/(section|assets/).*'. This worked. (Although it would download sectionfoobar but this was acceptable for me and now we are wandering into regexp territory which is amply covered elsewhere on SO.)

score 0 · Answer 5 · answered Feb 01 '19 at 01:26

Check out archivebox.io, it's an open-source, self-hosted tool that creates a local, static, browsable HTML clone of websites (it saves HTML, JS, media files, PDFs, screenshot, static assets and more).

By default, it only archives the URL you specify, but we're adding a --depth=n flag soon that will let you recursively archive links from the given URL.

How to mirror only a section of a website?

5 Answers5

Linked