I just used wget
to archive a PhpBB2 forum completely. Things might be a bit different for PhpBB3 or newer version, but the basic approach is probably useful.
I first populated a file with session cookies (to
prevent phpbb from putting sid= in links), then did the actual mirror. This used
wget 1.20, since 1.18 messed up the --adjust-extension for non-html files (e.g.
gifs).
wget https://example.com/forum/ --save-cookies cookies \
--keep-session-cookies
wget https://example.com/forum/ --load-cookies cookies \
--page-requisites --convert-links --mirror --no-parent --reject-regex \
'([&?]highlight=|[&?]order=|posting.php[?]|privmsg.php[?]|search.php[?]|[&?]mark=|[&?]view=|viewtopic.php[?]p=)' \
--rejected-log=rejected.log -o wget.log --server-response \
--adjust-extension --restrict-file-names=windows
This tells wget to recursively mirror the entire site, including requisites (CSS and images). It rejects (skips) certain urls, mostly because they are no longer useful in a static site (e.g. search) or are just slightly different or even identical views on the same content (e.g. viewtopic.php?p=...
just returns the topic containing the given post, so no need to mirror that topic for each individual post. The --adjust-extension
option makes wget add .html to dynamically generated HTML pages, and --restrict-file-names=windows
makes it replace (among other things) the ?
with a @
, so you can actually put the result on a webserver without that webserver chopping the urls at the ?
(which normally starts the query parameters).