13

I have a MediaWiki running which represents a dictionary of German terms and their translation to a local dialect. Each page holds one term, its translation and a number of additional infos.

Now, for a printable version of the dictionary, I need a full export of all terms and their translation. Since this is an extract of a page's content, I guess I need a complete export of all pages in their newest version in a parsable format, e.g. xml or csv.

Has anyone done that or can point me to a tool? I should mention, that I don't have full access to the server, e.g. no command line, but I am able to add MediaWiki extensions or access the MySQL database.

svick
  • 236,525
  • 50
  • 385
  • 514
Alexander Rühl
  • 6,769
  • 9
  • 53
  • 96

7 Answers7

7

You can export the page content directly from the database. It will be the raw wiki markup, as when using Special:Export. But it will be easier to script the export, and you don't need to make sure all your pages are in some special category.

Here is an example:

SELECT page_title, page_touched, old_text
FROM revision,page,text
WHERE revision.rev_id=page.page_latest
AND text.old_id=revision.rev_text_id;

If your wiki uses Postgresql, the table "text" is named "pagecontent", and you may need to specify the schema. In that case, the same query would be:

SET search_path TO mediawiki,public;

SELECT page_title, page_touched, old_text 
FROM revision,page,pagecontent
WHERE revision.rev_id=page.page_latest
AND pagecontent.old_id=revision.rev_text_id;
mivk
  • 13,452
  • 5
  • 76
  • 69
  • Since I established the common category automatically and it works fine, I'll continue using this method, but anyway, thanks for the hint, it may come in handy. – Alexander Rühl Nov 18 '13 at 08:59
  • @Geziefer how did you automatically set the category? – B T Feb 06 '14 at 07:57
  • 1
    @BT: Each page is using the same template and there I have the common category ("term") for all of them. Plus I have used mediawiki functions to add another category by parsing the first letter of the page's name and make it a category. – Alexander Rühl Feb 06 '14 at 08:06
2

This worked very well for me. Notice I redirected the output to the file backup.xml. From a Windows Command Processor (CMD.exe) prompt:

cd \PATH_TO_YOUR_WIKI_INSTALLATION\maintenance
\PATH_OF_PHP.EXE\php dumpBackup.php --full > backup.xml
Robert Stevens
  • 519
  • 5
  • 9
1

Export

cd maintenance
php5 ./dumpBackup.php --current > /path/wiki_dump.xml

Import

cd maintenance
php5 ./importDump.php < /path/wiki_dump.xml
Sirex
  • 219
  • 4
  • 22
Semen
  • 29
  • 2
  • You may have missed the point "I should mention, that I don't have full access to the server, e.g. no command line" in my question. - And you should not mock about stackoverflow in an answer, but post specific problems on meta. – Alexander Rühl Jun 20 '13 at 13:40
1

I'm not completely satisfied with the solution, but I ended up specifying a common category for all pages and then I can add this category and all of the containing page names in the Special:Export box. It seems to work, allthough I'm not sure if it will still work when I reach a few thousand pages.

Alexander Rühl
  • 6,769
  • 9
  • 53
  • 96
  • 1
    An update to those using a similar method: We now have over 1400 terms in one export and it works quickly and without problems, so it seems to be a valid method. – Alexander Rühl Sep 06 '12 at 13:05
0

You can set https://www.mediawiki.org/wiki/Manual:$wgExportAllowAll to true, then export all pages from Special:Export.

plex
  • 11
  • 3
  • Thanks for the comment, but I guess this would export really all pages? In my situaltion I only need the pages holding a term, so I could go fine with the solution I mentioned further down. – Alexander Rühl Mar 09 '21 at 06:56
0

It looks less than simple. http://meta.wikimedia.org/wiki/Help:Export might help, but probably not.

If the pages are all structured in the same way, you might be able to write a web scraper with something like Scrapy

Rob Cowie
  • 22,259
  • 6
  • 62
  • 56
  • Well, the standard export function only exports discrete pages, what I need is a full export of all pages. Haven't heard of Scrapy, thanks for the hint, but I guess since all my pages use a common template it would be easier to extract the data from the list of key value pairs in each article than fiddling about the html. – Alexander Rühl Jul 19 '11 at 10:02
  • `dumpBackup.php` can output all pages to xml I think, though it requires server access to execute it, which you haven't got, so it's probably no use. If you have external access to the database that is probably your best bet – Rob Cowie Jul 19 '11 at 13:18
  • And what would you do exactly with the database access? The content is somewhat hidden there in a blob structure if I got it right. But maybe it would help to make a complete dump of the online site's database and import it into a local database with a local mediawiki? Anyway, I guess there must be some way to do the export, which seems to work for a page, for just all pages at once... – Alexander Rühl Jul 19 '11 at 15:58
  • Page text does seem to be stored in a BLOB in the `text` table. I don't have any experience with it so I'm not sure what would be involved in making it usable in your case. For info, the database schema is at http://upload.wikimedia.org/wikipedia/commons/b/b7/MediaWiki_database_schema_1-17_%28r82044%29.png – Rob Cowie Jul 19 '11 at 16:53
0

You can use the special page, Special:Export to export to XML; here is Wikipedia's version.

You might also consider Extension:Collection if you want it eventually human readable (e.g. PDF) form.

  • As said below, I know the special page to export, but there you have to add **all** page names as parameters. I was hoping to have the possibility to export just every page there is without having to specify them all. Human readable form is not what I want, I indeed need the content of the page, meaning the values filled in the template, instead of the rendered content. – Alexander Rühl Jul 27 '11 at 13:16