Obtaining static HTML files from Wikipedia XML dump

Question

I would like to be able to obtain relatively up-to-date static HTML files from the enormous (even when compressed) English Wikipedia XML dump file enwiki-latest-pages-articles.xml.bz2 I downloaded from the WikiMedia dump page. There seem to be quite a few tools available, although the documentation on them is pretty scant, so I don't know what most of them do or if they're up-to-date with the latest dumps. (I'm rather good at building web crawlers that can crawl through relatively small HTML pages/files, although I'm awful with SQL and XML, and I don't expect to be very good with either for at least another year.) I want to be able to crawl through HTML files obtained from a dump offline without resorting to crawling Wikipedia online.

Does anyone know of a good tool to obtain static HTML files from recent Wikipedia XML dumps?

score 5 · Accepted Answer · answered May 23 '12 at 07:21

5

First, import the data. Then create the HTML files with DumpHTML. Although simple in theory, this process could be complicated in practice due to the volume of data involved and DumpHTML being a bit neglected, so don't hesitate to ask for help.

answered May 23 '12 at 07:21

MaxSem

3,457
21
34

2

Also, it could take weeks or months. I used to import Wiktionary dumps several years ago, which were several orders of magnitude smaller, and it took several days. Doing it on a very beefy machine will help. I wonder if anybody can tell us how long it took them to import. – hippietrail May 23 '12 at 09:37
Processing time will definitely be a consideration. I may be able to get a beefy desktop machine at some point, although I don't know if that would be enough to deal with the scale we're talking about here. (I wonder if there is a parallel solution.) I know there are static HTML dumps available, although the most recent is from 2008, which is far less than ideal. – Brian Schmitz May 23 '12 at 13:59
What about dynamically rendering just the parts needed to render a given page as part of a script bundled with an offline custom Ubuntu distro? @hippietrail – Luke Stanley Dec 11 '13 at 14:34
1

@LukeStanley: You can't correctly render a MediaWiki page without the same version of MediaWiki, the same set of extensions, the same version of each extension, the same configuration, and the same set of templates. If you can make use of an incorrect render then you can get away with a lot less. – hippietrail Dec 11 '13 at 14:39
@hippietrail that may not be that hard if this Vagrant VM setup is suitable! http://www.mediawiki.org/wiki/Mediawiki-vagrant but a less correct renderer may be acceptable, depending on how incorrect it is :) – Luke Stanley Dec 19 '13 at 15:15
Sometimes "incorrect" means missing important information buried in or built by too many layers of clever templates. But yes that's exactly right. – hippietrail Dec 19 '13 at 15:21

Obtaining static HTML files from Wikipedia XML dump

1 Answers1

Linked